Sampling, Regression, Experimental Design and Analysis for...

Sampling, Regression, Experimental Design andAnalysis for Environmental Scientists,

Biologists, and Resource Managers

C. J. SchwarzDepartment of Statistics and Actuarial Science, Simon Fraser University

[email protected]

March 5, 2011

Contents

2 Introduction to Statistics 22.1 TRRGET - An overview of statistical inference . . . . . . . . . . 32.2 Parameters, Statistics, Standard Deviations, and Standard Errors 7

2.2.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Theoretical example of a sampling distribution . . . . . . 12

2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Some practical advice . . . . . . . . . . . . . . . . . . . . 232.3.3 Technical details . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.2 Comparing the population parameter against a known

standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.3 Comparing the population parameter between two groups 342.4.4 Type I, Type II and Type III errors . . . . . . . . . . . . 392.4.5 Some practical advice . . . . . . . . . . . . . . . . . . . . 422.4.6 The case against hypothesis testing . . . . . . . . . . . . . 442.4.7 Problems with p-values - what does the literature say? . . 46

Statistical tests in publications of the Wildlife Society . . 46The Insignificance of Statistical Significance Testing . . . 47Followups . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.5 Meta-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.5.1 Scales of measurement . . . . . . . . . . . . . . . . . . . . 492.5.2 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . 512.5.3 Roles of data . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.6 Bias, Precision, Accuracy . . . . . . . . . . . . . . . . . . . . . . 522.7 Types of missing data . . . . . . . . . . . . . . . . . . . . . . . . 562.8 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 582.8.2 Conditions under which a log-normal distribution appears 602.8.3 ln vs log . . . . . . . . . . . . . . . . . . . . . . . . . . . 602.8.4 Mean vs Geometric Mean . . . . . . . . . . . . . . . . . . 602.8.5 Back-transforming estimates, standard errors, and ci . . . 62

Mean on log-scale back to MEDIAN on anti-log scale . . 62

1

CONTENTS

2.8.6 Back-transforms of differences on the log-scale . . . . . . 632.8.7 Some additional readings on the log-transform . . . . . . 64

2.9 Standard deviations and standard errors revisited . . . . . . . . . 752.10 Other tidbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

2.10.1 Interpreting p-values . . . . . . . . . . . . . . . . . . . . . 822.10.2 False positives vs false negatives . . . . . . . . . . . . . . 832.10.3 Specificity/sensitivity/power . . . . . . . . . . . . . . . . 83

c©2010 Carl James Schwarz 2

Chapter 2

Introduction to Statistics

Statistics was spawned by the information age, and has been defined as thescience of extracting information from data. Technological developments havedemanded methodology for the efficient extraction of reliable statistics fromcomplex databases. As a result, Statistics has become one of the most pervasiveof all disciplines.

Theoretical statisticians are largely concerned with developing methods forsolving the problems involved in such a process, for example, finding new meth-ods for analyzing (making sense of) types of data that existing methods cannothandle. Applied statisticians collaborate with specialists in other fields in ap-plying existing methodologies to real world problems. In fact, most statisticiansare involved in both of these activities to a greater or lesser extent, and re-searchers in most quantitative fields of enquiry spend a great deal of their timedoing applied statistics.

The public and private sector rely on statistical information for such purposesas decision making, regulation, control and planning.

Ordinary citizens are exposed to many ‘statistics’ on a daily basis. Forexample:

• “In a poll of 1089 Canadians, 47% were in favor of the constitution accord.This result is accurate to within 3 percentage points, 19 time out of 20.”

• “The seasonally adjusted unemployment rate in Canada was 9.3%”.

• “Two out of three dentists recommend Crest.”

3

CHAPTER 2. INTRODUCTION TO STATISTICS

What does this all mean?

Our goal is not to make each student a ‘professional statistician’, but ratherto give each student a subset of tools with which they can confidently approachmany real world problems and make sense of the numbers.

2.1 TRRGET - An overview of statistical infer-ence

Section summary:

1. Distinguish between a population and a sample

2. Why it is important to choose a probability sample

3. Distinguish among the roles of randomization, replication, and blocking

4. Distinguish between an ‘estimate’ or a ‘statistic’ and the ‘parameter’ ofinterest.

Most studies can be broadly classified into either surveys or experiments.

In surveys, the researcher is typically interested in describing some popula-tion - there is usually no attempt to manipulate units within the population. Inexperiments, units from the population are manipulated in some fashion and aresponse to the manipulation is observed.

There are four broad phases to the survey or the experiment. These phasesdefine the paradigm of Statistical Inference. These phases will be illustrated inthe context of a political poll of Canadians on some issue as illustrated in thefollowing diagram.



The four phases are:

1. What is the population of interest and what is the parameter of interest?This formulates the research question - what is being measured and whatis of interest.In this case, the population of interest is likely all eligible voters in Canadaand the parameter of interest is the proportion of all eligible voters in favorof the accord.It is conceivable, but certainly impractical, that every eligible voter couldbe contacted and their opinion recorded. You would then know the valueof the parameter exactly and there would be no need to do any statistics.However, in most real world situations, it is impossible or infeasible tomeasure every unit in the population.Consequently, a sample is taken.

2. Selecting a sampleWe would like our sample to be as representative as possible - how is thisachieved? We would like our answer from our sample to be as precise aspossible - how is this achieved? And, we may like to modify our sampleselection method to take into account known division of the population -how is this achieved?Three fundamental principles of Statistics are randomization, replicationand blocking.

Randomization This is the most important aspect of experimental de-sign and surveys. It ensures that the sample is ‘representative’ of



the population by ensuring that, ‘on average’, the sample will con-tain a proportion of population units that is about equal, for anyvariable that may influence the responses of the unit as found in thepopulation.If an experiment is not randomized or a survey is not randomly col-lected, it rarely (if ever) provides useful information.Many people confuse ‘random’ with ‘haphazard’. The latter onlymeans that the sample was collected without a plan or thought toensure that the sample obtained is representative of the population.A truly ‘random’ sample takes surprisingly much effort to collect!E.g. The Gallup poll uses random digit dialing to select at randomfrom all households in Canada with a phone. Is this representativeof the entire voting population? How does the Gallup Poll accountfor the different patterns of telephone use among genders within ahousehold?A random sample is an example of a equal probability sample. Asyou will see later in the notes, the assumption of equal probabilitynot crucial - what is crucial is that every unit in the population havea known probability of selection.

Replication = Sample Size This ensures that the results from the ex-periment or the survey will be precise enough to be of use. A largesample size does not imply that the sample is representative - onlyrandomization ensures representativeness.Do not confuse replication with repeating the survey a second time.In this example, the Gallup poll interviews about 1100 Canadians. Itchooses this number of people to get a certain precision in the results.

Blocking (or stratification) In some experiments or surveys, the re-searcher knows of a variable that strongly influences the response. Inthe context of this example, there is strong relationship between theregion of the country and the response.Consequently, precision can be improved, by first blocking or stratify-ing the population into more homogeneous groups. Then a separaterandomized survey is done in each and every stratum and the resultsare combined together at the end.In this example, the Gallup poll often stratifies the survey by regionof Canada. Within each region of Canada, a separate randomizedsurvey is performed and the results are then combined appropriatelyat the end.

3. Data Analysis

Once the survey design is finalized and the survey is conducted, you willhave a mass of information - statistics - collected from the population.This must be checked for errors, transcribed usually into machine readableform, and summarized.



The analysis is dependent upon BOTH the data collected (the sample)and the way the data was collected (the sample selection process). Forexample, if the data were collected using a stratified sampling design,it must be analyzed using the methods for stratified designs - you can’tsimply pretend after the fact that the data were collected using a simplerandom sampling design.

We will emphasize this point continually in this course - you must matchthe analysis with the design!

For example, 511 out of 1089 Canadians interviewed were in favor, i.e.47% of our sample respondents were in favor.

4. Inference back to the Population

Despite an enormous amount of money spent collecting the data, interestreally lies in the population, not the sample. The sample is merely a deviceto gather information about the population.

How should the information from the sample, be used to make inferencesabout the population?

Graphing A good graph is always preferable to a table of numbers orto numerical statistics. A graph should be clear, relevant, and infor-mative. Beware of graphs that try to mislead by design or accidentthrough misleading scales, chart junk, or three dimensional effects.There a number of good books on effective statistical graphics - theseshould be consulted for further information. 1

Estimation The number obtained from our sample is an estimate of thetrue, unknown, value of the population parameter. How precise is ourestimate? Are we within 10 percentage points of the correct answer?A good survey or experiment will report a measure of precision forany estimate.In this example, 511 of 1089 people were in favor of the accord.Our estimate of the proportion of all Canadian voters in favor ofthe accord is 511/1089=47%. These results are ‘accurate to within3 percentage points, 19 times out of 20’, which implies that we arereasonably confident that the true proportion of voters in favor ofthe accord is between 47%-3%=44% and 47%+3%=50%.Technically, this is known as a 95% confidence interval - the detailsof which will be explored later in this chapter.

(Hypothesis) Testing Suppose that in last month’s poll (conducted ina similar fashion), only 42% of voters were in favor. Has the supportincreased? Because each percentage value is accurate to about 3percentage points, it is possible that in fact there has been no changein support!.

1An “perfect” thesis defense would be to place a graph of your results on the overhead andthen sit down to thunderous applause!



It is possible to make a more formal ‘test’ of the hypothesis of nochange. Again, this will be explored in more detail later in thischapter.

2.2 Parameters, Statistics, Standard Deviations,and Standard Errors

Section summary:

1. Distinguish between a parameter and a statistic

2. What does a standard deviation measure?

3. What does a standard error measure?

4. How are estimated standard errors determined (in general)?

2.2.1 A review

DDTs is a very persistent pesticide. Once applied, it remains in the environmentfor many years and tends to accumulate up the food chain. For example, birdswhich eat rodents which eat insects which ingest DDT contaminated plants canhave very high levels of DDT and this can interfere with reproduction. [Thisis similar to what is happening in the Great Lakes where herring gulls havevery high levels of pesticides or what is happening in the St. Lawrence Riverwhere resident beluga whales have such high levels of contaminants, they areconsidered hazardous waste if they die and wash up on shore.] DDT has beenbanned in Canada for several years, and scientists are measuring the DDT levelsin wildlife to see how quickly it is declining.

The Science of Statistics is all about measurement and variation. If therewas no variation, there would be no need for statistical methods. For example,consider a survey to measure DDT levels in gulls on Triangle Island off the coastof British Columbia, Canada. If all the gulls on Triangle Island had exactly thesame DDT level, then it would suffice to select a single gull from the island andmeasure its DDT level.

Alas, the DDT level can vary by the age of the gull, by where it feeds and ahost of other unknown and uncontrollable variables. Consequently the averageDDT level over ALL gulls on Triangle Island seems like a sensible measure ofthe pesticide load in the population. We recognize that some gulls may have



levels above this average, some gulls below this average, but feel that changesin the average DDT level are indicative of the health of the population.

Population mean and population standard deviation. Conceptually,we can envision a listing of the DDT levels of each and every gull on TriangleIsland. From this listing, we could conceivably compute the true populationaverage and compute the (population) standard deviation of the DDT levels.[Of course in practice these are unknown and unknowable.] Statistics often usesGreek symbols to represent the theoretical values of population parameters. Inthis case, the population mean is denoted by the Greek letter mu (µ) and thepopulation standard deviation by the Greek letter sigma (σ). The populationstandard deviation measures the variation of individual measurements about themean in the population.

In this example, µ would represent the average DDT over all gulls on theisland, and σ would represent the variation of values around the populationmean. Both of these values are unknown.

Scientists took a random sample (how was this done?) of 10 gulls and foundthe following DDT levels.

100, 105, 97, 103, 96, 106, 102, 97, 99, 103.

The raw data is also available in a JMP datasheet called ddt.jmp availablefrom the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Sample mean and sample standard deviation The sample average andsample standard deviation could be computed from these value using a spreadsheet, calculator, or a statistical package.

Here is the output from JMP :


http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms


A different notation is used to represent sample quantities to distinguishthem from population parameters. In this case the sample mean, denoted Yand pronounced Y-bar, has the value of 100.8 ppm, and the sample standarddeviation, denoted using the letter s, has the value of 3.52 ppm. The samplemean is a measure of the middle of the sample data and the sample standarddeviation measures the variation of the sample data around the sample mean.

What would happen if a different sample of 10 gulls was selected? It seemsreasonable that the sample mean and sample standard deviation would alsochange among samples, and we hope that if our sample is large enough, thatthe change in the statistics would not be that large.

Here is the data from an additional 8 samples, each of size 10:

SampleSet DDT levels in the gulls mean std1 102 102 103 95 105 97 95 104 98 103 100.4 3.82 100 103 99 98 95 98 94 100 90 103 98.0 4.13 101 96 106 102 104 95 98 103 108 104 101.7 4.24 101 100 99 90 102 99 105 92 100 102 99.0 4.65 107 98 101 100 100 98 107 99 104 98 101.2 3.66 102 102 101 101 92 94 104 100 101 97 99.4 3.87 94 101 100 100 96 101 100 98 94 98 98.2 2.78 104 102 97 104 97 99 100 100 109 102 101.4 3.7

Note that the statistics (Y - the sample mean, and s - the sample standarddeviation) change from sample to sample. This is not unexpected as it highlyunlikely that two different samples would give identical results.

What does the variation in the sample mean over repeated samples from thesame population tell us? For example, based on the values of the sample meanabove, could the true population mean DDT over all gulls be 150 ppm? Couldit be 120 ppm? Could it be 101 ppm? Why?

If more and more samples were taken, you would end up with a a largenumber of sample means. A histogram of the sample means over the repeatedsamples could be drawn. This would be known as the sampling distributionof the sample mean.

The latter result is a key concept of statistical inference and can be quiteabstract because, in practice, you never see the sampling distribution. Thedistribution of individual values over the entire population can be visualized;the distribution of individual values in the particular sample can be examineddirectly as you have actual data; the hypothetical distribution of a statisticsover repeated samples from the population is always present, but remains one



level of abstraction away from the actual data.

Because the sample mean varies from sample to sample, it is theoreticallypossible to compute a standard deviation of the statistic as it varies over allpossible samples drawn from the population. This is known as the standarderror (abbreviated SE) of the statistic (in this case it would be the standarderror of the sample mean).

Because we have repeated samples in the gull example, we can compute theactual standard deviation of the sample mean over the 9 (the original sample,plus the additional 8 sample means). This gives an estimated standard errorof 1.40 ppm. This measures the variability of the statistic (Y ) over repeatedsamples from the same population.

But - unless you take repeated samples from the same population, how canthe standard error ever be determined?

For example, refer back to the output from JMP. How was the value of1.1136 determined for the standard error of the mean?

Now statistical theory comes into play. Every statistic varies over repeatedsamples. In some cases, it is possible to derive from statistical theory, how muchthe statistic will vary from sample to sample. In the case of the sample mean fora sample selected at random from any population, the se is theoretically equalto:

se(Y ) =σ√n

Note that every statistic will have a different theoretical formula for itsstandard error and the formula will change depending on how thesample was selected.

But this theoretical standard error depends upon an unknown quantity (thetheoretical population standard deviation σ). It seems sensible to estimate thestandard error by replacing the value of σ by an estimate - the sample standarddeviation s. This gives:

Estimated Std Error Mean = s/√n = 3.5214/

√10 = 1.1136ppm.

This number is an estimate of the variability of Y in repeated samples of thesame size selected at random from the same population.

A Summary of the crucial points:

• Parameter The parameter is a numerical measure of the entire popula-tion. Two common parameters are the population mean (denoted by µ)and the population standard deviation (denoted by σ). The population



standard deviation measures the variation of individual values over allunits in the population. Parameters always refer to the population, neverto the sample.

• Statistics or Estimate: A statistic or an estimate is a numerical quan-tity computed from the SAMPLE. This is only a guess as to the truevalue of the population parameter. If you took a new sample, your esti-mate computed from the second sample, would be different than the valuecomputed form the first sample. Two common statistics are the samplemean (denoted Y ), and the sample standard deviation (denotes s). Thesample standard deviation measures the variation of individual values overthe units in the sample. Statistics always refer to the sample, never to thepopulation.

• Sampling distribution Any statistic or estimate will change if a newsample is taken. The distribution of the statistic or estimate over repeatedsamples from the same population is known as the sampling distribution.

• Theoretical Standard error: The variability of the estimate over allpossible repeated samples from the population is measured by the standarderror of the estimate. This is a theoretical quantity and could only becomputed if you actually took all possible samples from the population.

• Estimated standard error Now for the hard part - you typically onlytake a single sample from the population. But, based upon statisticaltheory, you know the form of the theoretical standard error, so you can useinformation from the sample to estimate the theoretical standard error.Be careful to distinguish between the standard deviation of individualvalues in your sample and the estimated standard error of the statistic.The formula for the estimated standard error is different for every statisticand also depends upon the way the sample was selected. Consequentlyit is vitally important that the method of sample selection andthe type of estimate computed be determined carefully beforeusing a computer package to blindly compute standard errors.

The concept of a standard error is the MOST DIFFICULT CONCEPT tograsp in statistics. The reason that it is so difficult, is that there is an extralayer of abstraction between what you observe and what is really happening.It is easy to visualize variation of individual elements in a sample because thevalues are there for you to see. It is easy to visualize variation of individualelements in a population because you can picture the set of individual units.But it is difficult to visualize the set of all possible samples because typicallyyou only take a single sample, and the set of all possible samples is so large.

As a final note, please do NOT use the ± notation for standard errors. Theproblem is that the ± notation is ambiguous and different papers in the samejournal and different parts of the same paper use the ± notation for different



meanings. Modern usage is to write phrases such as “the estimated mean DDTlevel was 100.8 (SE 1.1) ppm.”

2.2.2 Theoretical example of a sampling distribution

Here is more detailed examination of a sampling distribution where the actualset of all possible samples can be constructed. It shows that the sample meanis unbiased and that its standard error computed from all possible samplesmatches that derived from statistical theory.

Suppose that a population consisted of five mice and we wish to estimatethe average weight based on a sample of size 2. [Obviously, the example ishopelessly simplified compared to a real population and sampling experiment!]

Normally, the population values would not be known in advance (becausethen why would you have to take a sample?). But suppose that the five micehad weights (in grams) of:

33, 28, 45, 43, 47.

The population mean weight and population standard deviation are foundas:

• µ = (33+28+45+43+47) = 39.20 g and

• σ = 7.39 g.

The population mean is the average weight over all possible units in the popu-lation. The population standard deviation measures the variation of individualweights about the mean, over the population units.

Now there are 10 possible samples of size two from this population. For eachpossible sample, the sample mean and sample standard deviation are computedas shown in the following table.



Sample units Sample Mean Sample std dev(Y ) (s)

33 28 30.50 3.5433 45 39.00 8.4933 43 38.00 7.0733 47 40.00 9.9028 45 36.50 12.0228 43 35.50 10.6128 47 37.50 13.4445 43 44.00 1.4145 47 46.00 1.4143 47 45.00 2.83Average 39.20 7.07Std dev 4.52 4.27

This table illustrates the following:

• this is a theoretical table of all possible samples of size 2. Consequentlyit shows the actual sampling distribution for the statistics Y and s. Thesampling distribution of Y refers to the variation of Y over all the possiblesamples from the population. Similarly, the sampling distribution of srefers to the variation of s over all possible samples from the population.

• some values of Y are above the population mean, and some values of Yare below the population mean. We don’t know for any single sample if weare above or below the true value of the population parameter. Similarly,values of s (which is a sample standard deviation) also varies above andbelow the population standard deviation.

• the average (expected) value of Y over all possible samples is equal tothe population mean. We say such estimators are unbiased. This isthe hard concept! The extra level of abstraction is here - the statisticcomputed from an individual sample, has a distribution over all possiblesamples, hence the sampling distribution.

• the average (expected) value of s over all possible samples is NOT equal tothe population standard deviation. We say that s is a biased estimator.This is a difficult concept - you are taking the average of an estimate ofthe standard deviation. The average is taken over possible values of sfrom all possible samples. The latter is an extra level of abstraction fromthe raw data. [There is nothing theoretically wrong with using a biasedestimator, but most people would prefer to use an unbiased estimator. Itturns out that the bias in s decreases very rapidly with sample size andso is not a concern.]

• the standard deviation of Y refers to the variation of Y over all possiblesamples. We normally call this the standard error of a statistic. [The



term comes from an historical context that is not important at this point.].Do not confuse the standard error of a statistic with the sample standarddeviation or the population standard deviation. The standard error mea-sures the variability of a statistic (e.g. Y ) over all possible samples. Thesample standard deviation measures variability of individual units in thesample. The population standard deviation measures variability of indi-vidual units in the population.

• if the previous formula for the theoretical standard error was used in thisexample it would fail to give the correct answer:

i.e. se(Y ) = 4.52 6= σ√n= 7.39√

2= 5.22 The reason that this formulae didn’t

work is that the sample size was an appreciable fraction of the entirepopulation. A finite population correction needs to be applied in thesecases. As you will see in later chapters, the se in this case is computed as:

se(Y ) =σ√n

√(1− f)

√N

(N − 1)=

7.39√2

√(1− 2

5

√5

4= 4.52

Refer to the chapter on survey sampling for more details.

2.3 Confidence Intervals

Section summary:

1. Understand the general logic of why a confidence interval works

2. How to graph a confidence interval for a single parameter

3. How to interpret graphs of several confidence intervals

4. Effect of sample size upon the size of a confidence interval

5. Effect of variability upon the size of a confidence interval

6. Effect of confidence level upon the size of a confidence interval

2.3.1 A review

The basic premise of statistics is that every unit in a population cannot bemeasured – consequently, a sample is taken. But the statistics from a samplewill vary from sample to sample and it is highly unlikely that the value of thestatistic will equal the true, unknown value of the population parameter.



Confidence intervals are a way to express the level of certainty about thetrue population parameter value based upon the sample selected. The formulaefor the various confidence intervals depend upon the statistic used and how thesample was selected, but are all derived from a general unified theory.

The following concepts are crucial and will be used over and over again inwhat follows:

• Estimate: The estimate is the quantity computed from the SAMPLE.This is only a guess as to the true value of the population parameter. Ifyou took a new sample, your estimate computed from the second sample,would be different than the value computed form the first sample. It seemsreasonable that if you select your sample carefully that these estimates willsometimes be lower than the theoretical population parameters; sometimesit will be higher.

• Standard error: The variability of the estimate over repeated samplesfrom the population is measured by the standard error of the estimate. Itagain seems reasonable that if you select your sample carefully, that thestatistics should be ‘close’ to the true population parameters and that thestandard error should provide some information about the closeness of theestimate to the true population parameter.

Refer back to the DDT example considered in the last section. Scientists tooka random sample of gulls from Triangle Island (off the coast of Vancouver Island,British Columbia) and measured the DDT levels in 10 gulls. The followingvalues were obtained (ppm):

100, 105, 97, 103, 96, 106, 102, 97, 99, 103.

What does the sample tell us about the true population average DDT level overall gulls on Triangle Island?

We use JMP to compute sample statistics as before:



The sample mean, Y , = 100.8 ppm, and the sample standard deviation,s=3.52 ppm measures the middle of the sample data and the spread of thesample data around the sample mean.

Based on this sample information, is it plausible to believe that the averageDDT level over ALL gulls could be as high as 150 ppm? Could it be a low a 50ppm? Is it plausible that it could be as high as 110 ppm? As high as 101 ppm?

Suppose you had the information from the other 8 samples.

SampleSet DDT levels in the gulls mean std1 102 102 103 95 105 97 95 104 98 103 100.4 3.82 100 103 99 98 95 98 94 100 90 103 98.0 4.13 101 96 106 102 104 95 98 103 108 104 101.7 4.24 101 100 99 90 102 99 105 92 100 102 99.0 4.65 107 98 101 100 100 98 107 99 104 98 101.2 3.66 102 102 101 101 92 94 104 100 101 97 99.4 3.87 94 101 100 100 96 101 100 98 94 98 98.2 2.78 104 102 97 104 97 99 100 100 109 102 101.4 3.7

Based on this new information, what would you believe to be a plausiblevalue for the true population mean?

It seems reasonable that because the sample means when taken over repeatedsamples from the same population seem to lie between 98 and 102 ppm that thisshould provide some information about the true population value. For example,if you saw in the 8 additional samples that the range of the sample means variedbetween 90 and 110 ppm – would your plausible interval change?

Again statistical theory come into play. A very famous and important (forstatisticians!) theorem, the Central Limit Theorem, gives the theoretical sam-pling distribution of many statistics for most common sampling methods.



In this case, the Central Limit Theorem states that the sample mean froma simple random sample from a large population should have an approximatenormal distribution with the se(Y ) = σ√

n. The se of Y measures the variability

of Y around the true population mean when different samples of the same sizeare taken. Note that the variability of the sample mean is LESS than that ofindividual observations – does this makes sense?

Using the properties of a Normal distribution, there is a 95% probabilitythat Y will vary within about ±2se of the true mean (why?). Conversely, thereshould be about a 95% probability that the true mean should be within ±2seof Y ! This is the crucial step in statistical reasoning.

Unfortunately, σ - the population standard deviation is unknown so we can’tfind the se of Y . However, it seems reasonable to assume that s, the samplestandard deviation, is a reasonable estimator of σ, the population standarddeviation. So, s√

n, should be a reasonable estimator of σ√

n. This is what is

reported in the above output and we have that the Estimated Std ErrorMean = s/

√n = 3.5214/

√10 = 1.1136 ppm. This number is an estimate of

how variable Y is around the true population mean in repeated samples of thesame size from the same population.

Consequently, it seems reasonable that there should be about a 95% prob-ability, that the true mean is within ±2 estimated se of the sample mean, or,we state that an approximate 95% confidence interval is computed as:Y ± 2(estimated se) or 100.8 ± 2(1.1136) = 100.8 ± 2.2276 = (98.6 → 103.0)ppm.

It turns out that we have to also account for the fact that s is only anestimate of σ (s can also vary from sample to sample) and so the estimated semay not equal the theoretical standard error. Consequently, the multiplier (2)has to be increased slightly to account for this.

JMP does this automatically in the above output, and the interval of (98.3→ 103.3) is slightly wider.

For large samples (typically greater than 30), the multiplier is vey closeto 2 (actually has the value of 1.96) and there is virtually no difference inthe intervals because then s is a very good estimator of σ and no additionalcorrection is needed.

We say that we are 95% confident the true population mean (what ever itis) is somewhere in the interval (98.6→ 103.0) ppm. What does this mean? Weare pretty sure that the true mean DDT is not 110 ppm, nor is it 90 ppm. Butwe don’t really know if it is 99 or 102 ppm. Plausible values for the true meanDDT for ALL gulls is any value in the range 98.6 → 103.0 ppm.



Note that the interval is NOT an interval for the individual values in thepopulation but rather for the true population mean µ. Also, it is not a confidenceinterval for the sample mean (which you know to be 100.8) but rather for theunknown population mean µ. These two points are the most commonmis-interpretations of confidence intervals.

This interval can be graphed in JMP using diamonds (use the Analyze->Distribution platform.

The confidence interval is shown as the upper and lower limits of the dia-mond.

Notice that the upper and lower bars of a box-plot and the upper and lowerlimits of the confidence intervals are telling you different stories. Be sure thatyou understand the difference!

Many packages and published papers don’t show confidence intervals, butrather simply show the mean and then either ±1se or ±2se from the mean asapproximate 68% or 95% confidence intervals such as below:



There really isn’t any reason to plot ±1se as these are approximate 68%confidence limits which seems kind of silly. The reason this type of plot persistsis because it is the default option in Excel which has the largest collection ofbad graphs in the world. [A general rule of thumb - DON’T USE EXCEL FORSTATISTICS!]

What are the likely effects of changing sample sizes, differentamount of variability, and different levels of confidence upon the con-fidence interval width?

It seems reasonable that a large sample size should be ‘more precise’, i.e. haveless variation over repeated samples from the same population. This implies thata confidence interval based on a larger sample should be narrower for the samelevel of confidence, i.e. a 95% confidence interval from a sample with n = 100should be narrower than a 95% confidence interval from a sample with n = 10when taken from the same population.

Also, if the elements in a population are more variable, then the variationof the sample mean should be larger and the corresponding confidence intervalshould be wider.



And, why stop at a 95% confidence level - why not find a 100% confidenceinterval? In order to be 100% confident, you would have to sample the entirepopulation – not practical for most cases. Again, it seems reasonable thatinterval widths will increase with the level of confidence, i.e. a 99% confidenceinterval will be wider than a 95% confidence interval.

How are several groups of means compared if all were selectedusing a random sample? For now, one simple way to compare several groupsis through the use of side-by-side confidence intervals. For example, consider astudy that looked at the change in weight of animals when given one of threedifferent drugs (A, D, or placebo). Here is a side-by-side confidence intervalplot: These can be generated using the Analyze->Fit Y-by-X platform in JMP(make sure the X variable is nominal or ordinal scale and then choosing theMeans Diamonds from the special pop-up menu. For example, consider thedataset DrugLBI.JMP in the SAMPLE DATA directory of of JMP. Make surethat the drug column is nominal scale, the lbs column (weight change in pounds)is continuous scale, and then use the Analyze->Fit Y-by-X platform.



Or, confidence interval diamonds can also be displayed.



What does this show? Because the 95% confidence intervals for drug Aand drug D have considerable overlap, there doesn’t appear to be much of adifference in the population means (the same value could be common to bothgroups). However, the overlap between the Placebo and the other drugs is notvery much. The population means may differ. [Note the distinction betweenthe sample and population means in the above discussion.]

As another example, consider the following graph of barley yields for threeyears along with 95% confidence intervals drawn on the graphs. The data arefrom a study of crop yields downwind of a coal fired generating plant that startedoperation in 1985. What does this suggest?



Because the 95% confidence intervals for 1984 and 1980 overlap considerably,there really isn’t much evidence that the true mean yield differs. However,because the 95% confidence interval for 1988 does not overlap the other twogroups, there is good evidence that the population mean in 1988 is smaller thanin the previous two years.

In general, if the 95% confidence intervals of two groups do not overlap,then there is good evidence that the group population means differ. If there isconsiderable overlap, then the population means of both groups might be thesame.

2.3.2 Some practical advice

• In order for confidence intervals to have any meaning, the data must becollected using a probability sampling method. No amount of statisticalwizardry will give valid inference for data collected in a haphazard fashion.Remember, haphazard does not imply a random selection.

• If you consult statistical textbooks, they are filled with many hundreds offormulae for confidence intervals under many possible sampling designs.The formulae for confidence intervals are indeed different for various es-timators and sampling designs – but they are all interpreted in a similarfashion.

• A rough and ready rule of thumb is that a 95% confidence interval is foundas estimate ± 2se and a 68% confidence interval is found as estimate ±1se. Don’t worry too much about the exact formulae - if a study doesn’tshow clear conclusions based on these rough rules, then using more exactmethods won’t improve things.

• The crucial part is finding the se. This depends upon the estimator andsampling design – pay careful attention that the computer package you



are using and the options within the computer package match the actualdata collection methods. I can’t emphasize this too much! This is themost likely spot where you may inadvertently use inappropriate analysis!

• Confidence intervals are sensitive to outliers because both the sample meanand standard deviation are sensitive to outliers.

• If the sample size is small, then you must also make a very strong assump-tion about the population distribution. This is because the central limittheorem only works for large samples. Recent work using bootstrap andother resampling methods may be an alternative approach.

• The confidence interval only tells you the uncertainty in knowingthe true population parameter because you only measured asample from the population2. It does not cover potential impreci-sion caused by nonresponse, under-coverage, measurement errors etc. Inmany cases, these can be orders of magnitude larger - particularly if thedata was not collected according to a well defined plan.

2.3.3 Technical details

The technical details of a confidence interval for the population mean whenthe sample is collected using a random sample from a normal population arepresented here.

The exact formula for a confidence interval for a single mean when the dataare collected using a simple random sample from a population with normallydistributed data is:

Y ± tn−1 × se

orY ± tn−1

s√n

where the tn−1 refers to values from a t-distribution with (n − 1) degrees offreedom. Values of the t-distribution are tabulated in the tables located at http://www.stat.sfu.ca/~cschwarz/Stat-650//Notes/PDF/Tables.pdf. For theabove example for gulls on Triangle Island, n = 10, so the multiplier for a 95%confidence interval is t9 = 2.2622 and the confidence interval was found as: 100.8± 2.262(1.1136) = 100.8 ± 2.5192= (98.28 → 103.32) ppm which matches theresults provided by JMP .

Note that different sampling schemes may not use a t-distribution and mostcertainly will have different degrees of freedom for the t-distribution.

2This is technically known as the sampling error


http://www.stat.sfu.ca/~cschwarz/Stat-650//Notes/PDF/Tables.pdf

http://www.stat.sfu.ca/~cschwarz/Stat-650//Notes/PDF/Tables.pdf


This formula is useful when the raw data is not given, and only the summarystatistics (typically the sample size, the sample mean, and the sample standarddeviation) are given and a confidence interval needs to be computed.

What is the effect of sample size? If the above formula is examined, theprimary place where the sample size n comes into play is the denominator ofthe standard error. So as n increases, the se decreases. This is sensible becauseas the sample size increases, Y should be less variable (and usually closer to thetrue population mean). Consequently, the width of the interval decreases. Theconfidence level doesn’t change - we would still be roughly 95% confident, butthe interval is smaller. The sample size also affects the degrees of freedom whichaffects the t-value, but this effect is minor compared to that change in the se.

What is the effect of increasing the confidence level? If you wantedto be 99% confident, the t-value from the table increases. For example, the t-value for 9 degrees of freedom increases from 2.262 for a 95% confidence intervalto 3.25 for a 99% confidence interval. In general, a higher confidence level willgive a wider confidence interval.

2.4 Hypothesis testing

Section summary:

1. Understand the basic paradigm of hypothesis testing

2. Interpret p-values correctly

3. Understand Type I, Type II, and Type III errors

4. Understand the limitation of hypothesis testing

2.4.1 A review

Hypothesis testing is an important paradigm of Statistical Inference, but has itslimitations. In recent years, emphasis has moved away from formal hypothesistesting to more inferential statistics (e.g. confidence intervals) but hypothesistesting still has an important role to play.

There are two common hypothesis testing situations encountered in ecology:

• Comparing the population parameter against a known standard. For ex-ample, environmental regulations may specify that the mean contaminant



loading in water must be less than a certain fixed value.

• Comparing the population parameter among 2 or more groups. For ex-ample, is the average DDT loading the same for males and female birds.

The key steps in hypothesis testing are:

• Formulate the hypothysis of NO CHANGE in terms of POPULATIONparameters.

• Collect data using a good sampling design or a good experimental designpaying careful attention to the RRRs.

• Using the data, compute the difference between the sample estimate andthe standard, or the difference in the sample estimates among the groups.

• Evaluate if the observed change (or difference) is consistent with NO EF-FECT. This is usually summarized by a p-value.

2.4.2 Comparing the population parameter against a knownstandard

Again consider the example of gulls on Triangle Island introduced in previoussections.

Of interest is the mean DDT level in the gulls. Let (µ) represent the averageDDT over all gulls on the island. The value of this population parameter isunknown because you would have to measure ALL gulls which is logisticallyimpossible to do.

Now suppose that the value of 98 ppm is a critical value for the health ofthe species. Is there evidence that the current mean level is different than 98ppm?

Scientists took a random sample (how was this done?) of 10 gulls and foundthe following DDT levels.

100, 105, 97, 103, 96, 106, 102, 97, 99, 103.

The data was entered in JMP, and the Analyze->Distribution platform wasused to get the summary statistics:



First examine the 95% confidence interval presented above. The confidenceinterval excludes the value of 98 ppm, so one is fairly confident that the popula-tion mean DDT level differs from 98 ppm. Furthermore the confidence intervalgives information about what the population mean DDT level could be. Notethat the hypothesized value of 98 ppm is just outside the 95% confidence inter-val.

A hypothesis test is much more ‘formal’ and consists of several steps:

1. Formulate hypothesis. This is a formal statement of two alternatives.The null hypothesis (denoted as H0 or H) indicates the state of ignoranceor no effect. The alternate hypothesis (denoted as H1 or A) indicates theeffect that is to be detected if present.

Both the null and alternate hypothesis can be formulated before any dataare collected and are always formulated in terms of the population param-eter. It is NEVER formulated in terms of the sample statistics as thesewould vary from sample to sample.

In this case, the null and alternate hypothesis are:

H:µ = 98, i.e. the mean DDT levels for ALL gulls is 98 ppm.

H:µ 6= 98, i.e. the mean DDT levels for ALL gulls is not 98 ppm.

This is known as a two-sided test because we are interested if the mean iseither greater than or less than 98 ppm. 3

2. Collect data. Again it is important that the data be collected using prob-ability sampling methods and the RRRs. The form of the data collectionwill influence the next step.

3. Compute a test-statistic and p-value. The test-statistic is computedfrom the data and measures the discrepancy between the observed data

3It is possible to construct what are known as one-sided tests where interest lies ONLY ifthe population mean exceeds 98 ppm, or a test if interest lies ONLY if the population meanis less than 98 ppm. These are rarely useful in ecological work.



and the null hypothesis, i.e. how far is the observed sample mean of 100.8ppm from the hypothesized value of 98 ppm?

At this point, use JMP to do a test of the hypothesis that the populationmean against a KNOWN value:

Specify the known standard for the POPULATION mean of 98:



You should obtain the output:

The output contains with the hypothesized value (98 ppm) followed bythe estimate (the sample mean) of 100.8 ppm. From the earlier output,



we know that the se of the sample mean is 1.11. JMP also reports thestandard deviation (the deviation of individual values of DDT), but thisreally is NOT of interest. I would have much preferred to report the se ofthe mean.

How discordant is the sample mean of 100.8 ppm with the hypothesizedvalue of 98 ppm? One discrepancy measure is known as a T-ratio and iscomputed as:

T =(estimate − hypothesized value)

estimated se=

(100.8− 98)

3.52136/√10

= 2.5145

This implies the estimate is about 2.5 se different from the null hypothesisvalue of 98. This T-ratio is labelled as the Test Statistic in the output.

Note that there are many measures of discrepancy of the estimate with thenull hypothesis - JMPalso provides a ‘non-parametric’ statistic suitablewhen the assumption of normality in the population may be suspect - thisis not covered in this course.

How is this measure of discordance between the sample mean (100.8) andthe hypothesized value of 98 assessed? The unusualness of the test statis-tic is measured by finding the probability of observing the current teststatistic assuming the null hypothesis is true. In other words, if the hy-pothesis were true (and the true population mean is 98 ppm), what is theprobability finding a sample mean of 100.8 ppm?4

This is denoted the p-value. Notice that the p-value is attached to thedata - it measures the probability of the sample mean given the hypothesisis true. Probabilities CANNOT be attached to hypotheses - it wouldbe incorrect to say that there is a 3% that the hypothesis is true. Thehypothesis is either true or false, it can’t be “partially” true!5

In this case, there are three possible p-values depending upon the formof the alternate hypothesis. Because we are interested if the mean DDTvalue is greater than or less than 98 ppm, the appropriate p-value is theone denoted ‘Prob > |t|’. Our p-value is found to be 0.0331.

4. Make a decision. How are the test statistic and p-value used? The basicparadigm of hypothesis testing is that unusual events provide evidenceagainst the null hypothesis. Logically, rare events “shouldn’t happen” ifthe null hypothesis is true. This logic can be confusing! We willdiscuss it more in class.

In our case, the p-value of 0.0331 indicates there is an approximate 3.3%chance of observing a sample mean that differs from the hypothesizedvalue of 98 if the null hypothesis were true.

4In actual fact, the probability is computed of finding a value of 100.8 or more distant fromthe hypothesized value of 98. This will be explained in more detail later in the notes.

5This is similar to asking a small child if they took a cookie. The truth is either “yes” or“no”, but often you will get the response of “maybe” which really doesn’t make much sense!



Is this unusual? There are no fixed guidelines for the degree of unusualnessexpected before declaring it to be unusual. Many people use a 5% cut-offvalue, i.e. if the p-value is less than 0.05, then this is evidence against thenull hypothesis; if the p-value is greater than 0.05 then this not evidenceagainst the null hypothesis. [This cut-off value is often called the α-level.] If we adopt this cut-off value, then our observed p-value of 0.0331is evidence against the null hypothesis and we find that there is evidencethat the true mean DDT level is different than 98 ppm.

The plot at the bottom of the output that is presented by JMP is helpfulin trying to understand what is going on. [No such equivalent plot is readilyavailable in R.] tries to give a measure of how unusual the sample mean of 100.8is relative to the hypothesized value of 98. If the hypothesis were true, and thetrue population mean was 98, then you would expect the sample means to beclustered around the value of 98. The bell-shaped curve shows the distribution ofthe SAMPLE MEANS if repeated samples are taken from the same population.It is centered over the true population mean (98) with a variability measuredby the se of 1.11. The small vertical tick mark just under the value of 101,represents the observed sample mean of 100.8. You can see that the observedsample mean of 100.8 is somewhat unsual compared to the population value of98. The shaded areas in the two tails represents the probability of observingthe value of 100.8 (or worse) for the sample mean if the hypothesis were trueand represents the p-value.

If you repeated the same steps with a hypothesized value of 80, you wouldsee that the observed sample mean of 100.8 is extremely unusual relative to thepopulation value of 80:



The p-value is < .0001 indicating that the observed sample mean of 100.8 is veryunusual relative to the hypothesis. We have evidence against the hypothesizedvalue of 80.

Again, look at the graph produced by JMP. If the hypothesis were true, i.e.the true population mean (µ) was 80 ppm, then you would expect to see mostof the sample means clustered around the value of 80 (the curve shown). Theactual value of 100.8 is very unusual (if the hypothesis were true) – the verticaltick mark is so far from what would be expected.

Conversely, repeat the steps with a hypothesized value of 100:



Now we would expect the sample means to cluster around the value of 100 asshown by the curve. The observed value of 100.8 for the sample mean is not veryunusually at all. The p-value is .49 which is quite large – there is no evidencethat the observed sample mean is unusual relative to the hypothesized mean of100.

Technical details

The example presented above is a case of testing a population mean againsta known value when the population values have a normal distribution and thedata is selected using a simple random sample.

• The null and alternate hypotheses are written as:

H: µ = µ0

A: µ 6= µ0

where µ0 = 98 is the hypothesized value.

The test statistic:

T =(estimate − hypothesized value)

estimated se=Y − µ0

s/√n



then has a t-distribution with n− 1 degrees of freedom.The observed value of the test statistic is

T0 =(100.8− 98)

3.52136= 2.5145

The p-value is computed by finding

Prob(T > |T0|)

and is found to be 0.0331.

• In some very rare cases, the population standard deviation σ is known.In these cases, the true standard error is known, and the test-statistics iscompared to a normal distribution. This is extremely rare in practise.

• The assumption of normality of the population values can be relaxed if thesample size is sufficiently large. In those cases, the central limit theoremindicates that the distribution of the test-statistics is known regardless ofthe underlying distribution of population values.

2.4.3 Comparing the population parameter between twogroups

A common situation in ecology is to compare a population parameter in two(or more) groups. For example, it may be of interest to investigate if the meanDDT levels in males and females birds could be the same.

There are now two population parameters of interest. The mean DDT level ofmale birds is denoted as µm while the mean DDT level of female birds is denotedas µf . These would be the mean DDT level of all birds of their respective sex– once again this cannot be measured as not all birds can be sampled.

The hypotheses of interest are:

• H : µm = µf or H : µm − µf = 0

• A : µm 6= µf or H : µm − µf 6= 0

Again note that the hypotheses are in terms of the POPULATION parameters.The alternate hypothesis indicates that a difference in means in either directionis of interest, i.e. we don’t have an a prior belief that male birds have a smalleror larger population mean compared to female birds.

A random sample is taken from each of the populations using the RRR.



The data are entered into JMP as:

Notice there are now two columns. One column identifies the group membershipof each bird (the sex) and is nominal or ordinal in scale. The second columngives the DDT reading for each bird.6

The Analyze->Fit Y-by-X platform is used to compare the population meansof the two groups:

6The columns can be in any order. As well, the data can be in any order and male andfemale birds can be interspersed.



Specify the variable that defines the groups (sex) as the X variable and themeasured response (the DDT reading) as the Y variable:

This produces a side-by-side dot-plot:



Next, compute simple summary statistics for each group:



The individual sample means and se for each sex are reported.along with 95%confidence intervals for the population mean DDT of each sex. The 95% confi-dence intervals for the two sexes have virtually no overlap which implies that asingle plausible value common to both sexes is unlikely to exist.

Because we are interested in comparing the two population means, it seemssensible to estimate the difference in the means. This can be done for thisexperiment using a statistical technique called (for historical reasons) a “t-test”7

This gives the output:

The first part of the output estimates the difference in the population means.Because each sample mean is an unbiased estimator for the corresponding pop-ulation mean, it seems reasonable that the difference in sample means shouldbe unbiased for the difference in population mean.

Unfortunately, many packages do NOT provide information on which orderthe difference in means was computed. Many packages usually orders the groupsalphabetically, but this can be often be changed. The estimated difference inmeans is -4.4 ppm. The difference is negative indicating that the sample meanDDT for the males is less than the sample mean DDT for the females. As usual,a measure of precision (the se) should be reported for each estimate. The se forthe difference in means is 1.14 (refer to later chapters on how the se is computed),

7The t-test requires a simple random sample from each group.



and the 95% confidence interval for the difference in population means is from(−7.1 → −1.72). Because the 95% confidence interval for the difference inpopulation means does NOT include the value of 0, there is evidence that themean DDT for all males could be different than the mean DDT for all females.

The graph on the right is interpreted in a similar fashion. If the hypothesiswere true, the the difference in sample means should be clustered around zero(why?). The curve shows the distribution of the differences in sample means ifthe hypothesis were true. We see that the observed difference of -4.4 (indicatedby the vertical tick mark) is fairly unusually – it is quite far from the bulk ofthe distribution that should be centered around zero.

The t ratio is again a measure of how far the difference in sample meansis from the hypothesized value of 0 difference and is found as the observeddifference divided by the se of the difference. The p-value of .0058 indicates thatthe observed difference in sample means of -4.4 is quite unusual if the hypothesiswere true. Because the p-value is quite small, there is strong evidence againstthe hypothesis of no difference.

The comparison of means (and other parameters) will be explored in moredetail in future chapters.

2.4.4 Type I, Type II and Type III errors

Hypothesis testing can be thought of as analogous to a court room trial. Thenull hypothesis is that the defendant is “innocent” while the alternate hypothesisis that the defendant is “guilty”. The role of the prosecutor is to gather evidencethat is inconsistent with the null hypothesis. If the evidence is so unusual (underthe assumption of innocence), the null hypothesis is disbelieved.

Obviously the criminal justice system is not perfect. Occasionally mistakesare made (innocent people are convicted or guilty parties are not convicted).

The same types of errors can also occur when testing scientific hypotheses.For historical reasons, two possible types of errors than can occur are labeledas Type I and Type II errors:

• Type I error Also known as a false positive. A Type I error occurs whenevidence against the null hypothesis is erroneously found when, in fact,the hypothesis is true. How can this occur? Well, the p-value measuresthe probability that the data could have occurred, by chance, if the nullhypothesis were true. We usually conclude that the evidence is strongagainst the null hypothesis if the p-value is small, i.e. a rare event. How-



ever, rare events do occur and perhaps the data is just one of these rareevents. The Type I error rate can be controlled by the cut-off value usedto decide if the evidence against the hypothesis is sufficiently strong. Ifyou believe that the evidence strong enough when the p-value is less thanthe α=.05 level, then you are willing to accept a 5% chance of making aType I error.

• Type II error Also known as a false negative. A Type II error occurswhen you believe that the evidence against the null hypothesis is not strongenough, when, in fact, the hypothesis is false. How can this occur? Theusual reasons for a Type II error to occur are that the sample size is toosmall to make a good decision. For example, suppose that the confidenceinterval for the gull example, extended from 50 to 150 ppm. There is noevidence that any value in the range of 50 to 150 is inconsistent with thenull hypothesis.

There are two types of correct decision:

• Power. The power of a hypothesis test is the ability to conclude thatthe evidence is strong enough against the null hypothesis when in factis is false, i.e the ability to detect if the null hypothesis is false. This iscontrolled by the sample size.

• Specificity. The specificity of a test is the ability to correctly find noevidence against the null hypothesis when it is true.

In any experiment, it is never known if one of these errors or a correctdecision has been made. The Type I and Type II errors and the two correctdecision can be placed into a summary table:



Action Takenp-value is < α. Evidenceagainst the null hypothe-sis.

p-value is > α. No ev-idence against the nullhypothesis.

True stateof nature

Null Hy-pothesistrue

Type I error= False pos-itive error. This is con-trolled by the α-levelused to decide if the ev-idence is strong enoughagainst the null hypoth-esis.

Correct decision. Alsoknown as the specificityof the test.

Null Hy-pothesisfalse

Correct decision. This isknown as the power ofthe test or the sensitivityof the test. Controlledby the sample size with alarger sample size havinggreater power to detect afalse null hypothesis.

Type II error= False neg-ative error. Controlledby sample size with alarger sample size lead-ing to fewer Type II er-rors.

In the context of a monitoring design to determine if there is an evironmentalimpact due to some action, the above table reduces to:

Action Takenp-value is < α. Evidenceagainst the null hypothe-sis. Impact is apparentlyobserved.

p-value is > α. No ev-idence against the nullhypothesis. Impact isapparently not observed

True stateof nature

Null Hy-pothesistrue. No en-vironmentalimpact.

Type I error= False pos-itive error. An envi-ronmental impact is “de-tected” when in fact,none occured.

Correct decision. No en-vironmental impact de-tected.

Null Hy-pothesisfalse. Envi-ronmentalimpactexists.

Correct decision. En-vironmental impact de-tected.

Type II error= False neg-ative error. Environmen-tal impact not detected.

Usually, a Type I error is more serious (convicting an innocent person; falsedetecting an environmental impact and fining an organization millions of dollar)and so we want good evidence before we conclude against the null hypothesis.We measure the strength of the evidence by the p-value. Typically, we want the



p-value to be less than about 5% before we believe that the evidence is strongenough against the null hypothesis, but this can be varied depending on theproblem. If the consequences of a Type I error are severe, the evidence must bevery strong before action is taken, so the α level might be reduced to .01 fromthe “usual” .05.

Most experimental studies tend to ignore power (and Type II error) issues.However, these are important – for example, should an experiment be run thatonly has a 10% chance of detecting an important effect? What are the conse-quence of failing to detect an environmental impact. What is the price tag toletting a species go extinct without detecting it? We will explore issues of powerand sample size in later chapters.

What is a Type III error? This is more whimsical, as it refers to a correctanswer to the wrong question! Too often, researchers get caught up in theirparticular research project and spent much time and energy in obtaining ananswer, but the answer is not relevant to the question of interest.

2.4.5 Some practical advice

• The p-value does NOT measure the probability that the null hypothesisis true. It measures the probability of observing the sample data assumingthe null hypothesis were true. You cannot attach a probability statementto the null hypothesis in the same way you can’t be 90% pregnant! Thehypothesis is either true or false – there is no randomness attached to ahypothesis. The randomness is attached to the data.

• A rough rule of thumb is there is sufficient evidence against the hypothesisif the observed test statistic is more than 2 se away from the hypothesizedvalue.

• The p-value is also known as the observed significance level. In thepast, you choose a prespecified significance level (known as the α level)and if the p-value is less than α, you concluded against the null hypothesis.For example, α is often set at 0.05 (denoted α =0.05). If the p-value<α = 0.05, then you concluded tat the evidence was strong aganst the nullhypothesis; otherwise you the evidence was not strong enought againstthe null hypothesis. Scientific papers often reported results using a seriesof asterisks, e.g. “*” meant that a result was statistically significant atα = .05; “**” meant that a result was statistically significant at α = .01;“***” meant that a result was statistically significant at α = .001. Thispractice reflects a time when it was quite impossible to compute the exactp-values, and only tables were available. In this modern era, there is noexcuse for failing to report the exact p-value. All scientific papers shouldreport the actual p-value for a test so that the reader can use their own



personal significance level.

• Some ‘traditional’ and recommended nomenclature for the results of hy-pothesis testing:

p-value Traditional Recommendedp-value< 0.05

Reject the null hypothe-sis

There is strong evidenceagainst the null hypothe-sis.

.05 <p-value< .15

Barely fail to reject thenull hypothsis

Evidence is equivocoland we need more data.

p-value> .15

Fail to reject the null hy-pothesis.

There is no evidenceagainst the null hypoth-esis.

However, the point at which we conclude that there is sufficient evidenceagainst the null hypothesis (the α level which was .05 above) dependsupon the situation at hand and the consequences of wrong decisions (seelater in this chapter)..

• It is not good form to state things like:

– accept the null hypothesis

– accept the alternate hypothesis

– the null hypothesis is true

– the null hypothesis is false,

The reason is that you haven’t ‘proved’ the truthfulness or falseness of thehypothesis; rather you have not or have sufficient evidence that contra-dict it. It is the same reasons that jury trials return verdicts of ‘guilty’(evidence against the hypothesis of innocence) or ‘not guilty’ (insufficentevidence against the hypothesis of innocence). A jury trial does NOTreturn an ‘innocent’ verdict.

• If there is evidence against the null hypothesis, a natural question to askis ‘well, what value of the parameter are plausible given this data’? Thisis exactly what a confidence interval tells you. Consequently, I usuallyprefer to find confidence intervals, rather than doing formal hypothesistesting.

• Carrying out a statistical test of a hypothesis is straightforward with manycomputer packages. However, using tests wisely is not so simple. Hypoth-esis testing demands the RRR. Any survey or experiment that doesn’tfollow the three basic principles of statistics (randomization, replication,and blocking) is basically useless. In particular, non randomized surveysor experiments CANNOT be used in hypothesis testing or inference. Becareful that ‘random’ is not confused with ‘haphazard’. Computer pack-ages do not know how you collected data. It is your responsibility to ensure



that your brain in engaged before putting the package in gear. Each test isvalid only in circumstances where the method of data collection adheres tothe assumptions of the test. Some hesitation about the use of significancetests is a sign of statistical maturity.

• Beware of outliers or other problems with the data. Be prepared to spenda fair amount of time examining the raw data for spurious points.

2.4.6 The case against hypothesis testing

In recent years, there has been much debate about the usefulness of hypothesistesting in scientific research (see the next section for a selection of articles).There a number of “problems” with the uncritical use of hypothesis testing:

• Sharp null hypothesis The value of 98 ppm as a hypothesized valueseems rather arbitrary. Why not 97.9 ppm or 98.1 ppm. Do we reallythink that the true DDT value is exactly 98.000000000 ppm? Perhaps itwould be more reasonable to ask “How close is the actual mean DDT inthe population to 98 ppm?”

• Choice of α The choice of α-level (i.e. 0.05 significance level) is alsoarbitrary. The value of α should reflect the costs of Type I errors, i.e. thecosts of false positive results. In a murder trial, the cost of sending aninnocent person to the electric chair is very large - we require a very largeburden of proof, i.e. the p-value must be very small. On the other hand,the cost of an innocent person paying for a wrongfully issued parking ticketis not very large; a lesser burden of proof is required, i.e. a higher p-valuecan be used to conclude that the evidence is strong enough against thehypothesis.A similar analysis should be made for any hypothesis testing case, butrarely is done.The tradeoffs between Type I and II errors, power, and sample size arerarely discussed in this context.

• Sharp decision rules Traditional hypothesis testing says that if the p-valueis less than α, you should conclude that there is sufficient evidenceagainst the null hypothesis and if the p-value is greater than α there is notenough evidence against the null hypothesis. Suppose that α is set at .05.Should different decisions be made if the p-value is 0.0499 or 0.0501? Itseems unlikely that extremely minor differences in the p-valueshould leadto such dramatic differences in conclusions.

• Obvious tests In many cases, hypothesis testing is used when the evi-dence is obvious. For example, why would you even bother testing if thetrue mean is 50 ppm? The data clearly shows that it is not.



• Interpreting p-values P -values are prone to mis-interpretation as theymeasure the plausibility of the data assuming the null hypothesis is true,not the probability that the hypothesis is true. There is also the confusionbetween selecting the appropriate p-valuefor one- and two-sided tests.

Refer to the Ministry of Forest’s publication Pamphlet 30 on interpretingthe p-value available at http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html.

• Effect of sample size P -values are highly affected by sample size. Withsufficiently large sample sizes every effect is statistically significant butmay be of no biological interest.

• Practical vs statistical significance. Just because you find evidenceagainst the null hypothesis (e.g. p-value < .05) does not imply that theeffect is very large. For example, if you were to test if a coin were fairand were able to toss it 1,000,000 times, you would find strong evidenceagainst the null hypothesis of fairness if the observed proportion of headswas 50.001%. But for all intents and purposes, the coin is fair enough forreal use. Statistical significance is not the same as practical signif-icance. Other examples of this trap, are the numerous studies that showcancerous effects of certain foods. Unfortunately, the estimated increasein risk from these studies is often less than 1/100 of 1%!

The remedy for confusing statistical and practical significance is to ask fora confidence interval for the actual parameter of interest. This will oftentell you the size of the purported effect.

• Failing to detect a difference vs no effect. Just because an experi-ment fails to find evidence against the null hypothesis (e.g. p-value > .05)does not mean that there is no effect! A Type II error - a false negativeerror - may have been committed. These usually occur when experimentsare too small (i.e. inadequate sample size) to detect effects of interest.

The remedy for this is to ask for the power of the test to detect the effectof practical interest, or failing that, ask for the confidence interval for theparameter. Typically power will be low, or the confidence interval will beso wide as to be useless.

• Multiple testing. In some experiments, hundreds of statistical tests areperformed. However, remember that the p-value represents the chancethat this data could have occurred given that the hypothesis is true. So ap-value of 0.01 implies, that this event could have occurred in about 1% ofcases EVEN IF THE NULL IS TRUE. So finding one or two significantresults out of hundreds of tests is not surprising!

There are more sophisticated analyses available to control this problemcalled ‘multiple comparison techniques’ and are covered in more advancedclasses.


http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html

http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html


On the other hand, a confidence interval for the population parameter givesmuch more information. The confidence interval shows you how precise theestimate is, and the range of plausible values that are consistent with the datacollected.

Modern statistical methodology is placing more and more emphasis upon theuse of confidence intervals rather than a blind adherence on hypothesis testing.

2.4.7 Problems with p-values - what does the literaturesay?

There were two influential papers in the Wildlife Society publications that haveaffected how people view the use of p-values. Copies of these publications areavailable in the supplemental reading package.

Statistical tests in publications of the Wildlife Society

Cherry, S. (1998)Statistical tests in publication of the Wildlife SocietyWildlife Society Bulletin, 26, 947-954.http://www.jstor.org/stable/3783574.

The 1995 issue of the Journal of Wildlife Management has > 2400p-values. I believe that is too many. In this article, I will arguethat authors who publish in the Journal and in he Wildlife SocietyBulletin are over using and misunderstanding hypothesis tests. Theyare conducting too many unnecessary tests, and they are makingcommon mistakes in carrying out and interpreting the results of thetests they conduct. A major cause of the overuse of testing in theJournal and the Bulletin seems to be the mistaken belief that testingis necessary in order for a study to be valid or scientific.

• What are the problems in the analysis of habitat availability.

• What additional information do confidence intervals provide that signifi-cance levels do not provide?

• When is the assumption of normality critical, in testing if the means oftwo population are equal?

• What does Cherry recommend in lieu of hypothesis testing?


http://www.jstor.org/stable/3783574


The Insignificance of Statistical Significance Testing

Johnson, D. H. (1999)The Insignificance of Statistical Significance TestingJournal of Wildlife Management, 63, 763-772. http://dx.doi.org/10.2307/3802789 or online at http://www.npwrc.usgs.gov/resource/methods/statsig/index.htm

Despite their wide use in scientific journals such as The Journalof Wildlife Management, statistical hypothesis tests add very littlevalue to the products of research. Indeed, they frequently confusethe interpretation of data. This paper describes how statistical hy-pothesis tests are often viewed, and then contrasts that interpreta-tion with the correct one. He discusses the arbitrariness of p-values,conclusions that the null hypothesis is true, power analysis, and dis-tinctions between statistical and biological significance. Statisticalhypothesis testing, in which the null hypothesis about the propertiesof a population is almost always known a priori to be false, is con-trasted with scientific hypothesis testing, which examines a crediblenull hypothesis about phenomena in nature. More meaningful alter-natives are briefly outlined, including estimation and confidence in-tervals for determining the importance of factors, decision theory forguiding actions in the face of uncertainty, and Bayesian approachesto hypothesis testing and other statistical practices.

This is a very nice, readable paper, that discusses some of the problems withhypothesis testing. As in the Cherry paper above, Johnson recommends thatconfidence intervals be used in place of hypothesis testing.

So why are confidence intervals not used as often as they should? Johnsongive several reasons

• hypothesis testing has become a tradition;

• the advantages of confidence intervals are not recognized;

• there is some ignorance of the procedures available;

• major statistical packages do not include many confidence interval esti-mates;

• sizes of parameter estimates are often disappointingly small even thoughthey may be very significantly different from zero;

• the wide confidence intervals that often result from a study are embar-rassing;


http://dx.doi.org/10.2307/3802789

http://dx.doi.org/10.2307/3802789

http://www.npwrc.usgs.gov/resource/methods/statsig/index.htm

http://www.npwrc.usgs.gov/resource/methods/statsig/index.htm


• some hypothesis tests (e.g., chi square contingency table) have no uniquelydefined parameter associated with them; and

• recommendations to use confidence intervals often are accompanied by rec-ommendations to abandon statistical tests altogether, which is unwelcomeadvice.

These reasons are not valid excuses for avoiding confidence intervals in lieu ofhypothesis tests in situations for which parameter estimation is the objective.

Followups

In

Robinson, D. H. and Wainer, H. W. (2002).On the past and future of null hypothesis significance testing.Journal of Wildlife Management 66, 263-271.http://dx.doi.org/10.2307/3803158.

the authors argue that there is some benefit to p-values in wildlife management,but then

Johnson, D. H. (2002).The role of hypothesis testing in wildlife science.Journal of Wildlife Management 66, 272-276.http://dx.doi.org/10.2307/3803159.

counters many of these arguments. Both papers are very easy to read and arehighly recommended.

2.5 Meta-data

Meta-data are data about data, i.e how has it been collected, what are the units,what do the codes used in the dataset represent, etc. It is good practice to storethe meta-data as close as possible to the raw data. For example, some computerpackages (e.g. JMP) allow the user to store information about each variable andabout the data table.

In some cases, data can be classified into broad classifications called scale orroles.


http://dx.doi.org/10.2307/3803158

http://dx.doi.org/10.2307/3803159


2.5.1 Scales of measurement

Data comes in various sizes and shapes and it is important to know about theseso that the proper analysis can be used on the data.

Some computer packages (e.g. JMP) use the scales of measurement to deter-mine appropriate analyses of the data. For example, as you will see later in thecourse, if the response variable (Y ) is interval scaled and the explanatory vari-able (X) is nominally scales, then an ANOVA-type analysis comparing means isperformed. If both the Y and X variables are nominally scaled, then a χ2-typeanalysis on comparing proportions is performed.

There are usually 4 scales of measurement that must be considered:

1. Nominal Data

• the data are simply classifications

• the data have no ordering

• the data values are arbitrary labels

• e.g. sex using codes m and f , or codes 0 and 1. Note that justbecause a numeric code is used for sex, the variable is still nomi-nally scaled. The practice of using numeric codes for nominal datais discouraged (see below).

2. Ordinal Data

• the data can be ordered but differences between values cannot bequantified

• e.g. political parties on left to right spectrum given labels 0, 1, 2

• e.g. Likert scales, rank on a scale of 1..5 your degree of satisfaction

• e.g. restaurant ratings

• e.g. size of animals as small, medium, large or coded as 1, 2, 3. Againnumeric codes for ordinal data are discouraged (see below).

3. Interval Data

• the data can be ordered, have a constant scale, but have no naturalzero

• this implies that differences between data values are meaningful, butratios are not

• there are really only two common interval scaled variables, e.g. tem-perature (◦C, ◦F), dates. For example, 30 ◦C-20 ◦C=20 ◦C-10 ◦C,but 20 ◦C/10 ◦C is not twice as hot!



4. Ratio Data

• data can be ordered, have a constant scale, and have a natural zero

• e.g. height, weight, age, length

Some packages (e.g. JMP) make no distinction between Interval or Ratiodata, calling them both ‘continuous’ scaled. However, this is, technically, notquite correct.

Only certain operations can be performed on certain scales of measurement.The following list summarizes which operations are legitimate for each scale.Note that you can always apply operations from a ‘lesser scale’ to any particulardata, e.g. you may apply nominal, ordinal, or interval operations to an intervalscaled datum.

• Nominal Scale. You are only allowed to examine if a nominal scaledatum is equal to some particular value or to count the number of occur-rences of each value. For example, gender is a nominal scale variable. Youcan examine if the gender of a person is F or count the number of malesin a sample. Talking the average of nominally scaled data is not sensible(e.g. the average sex is not sensible).

In order to avoid problems with computer packages trying to take averagesof nominal data, it is recommended that alphanumerical codes be used fornominally scaled data, e.g. use M and F for sex rather than 0 or 1. Mostpackages can accept alphanumeric data without problems.

• Ordinal Scale. You are also allowed to examine if an ordinal scale datumis less than or greater than another value. Hence, you can ‘rank’ ordinaldata, but you cannot ‘quantify’ differences between two ordinal values.For example, political party is an ordinal datum with the NDP to the leftof the Conservative Party, but you can’t quantify the difference. Anotherexample is preference scores, e.g. ratings of eating establishments where10=good, 1=poor, but the difference between an establishment with a 10ranking and an 8 ranking can’t be quantified.

Technically speaking, averages are not really allowed for ordinal data, e.g.taking the average of small, medium and large as data values doesn’t makesense. Again alphanumeric codes are recommended for ordinal data. Somecase should be taken with ordinal data and alphanumeric codes as manypackages sort values alphabetically and so the ordering of large, medium ,small may not correspond to the ordering desired. JMP allows the user tospecify the ordering of values in the Column Information of each variable.A simple trick to get around this problem, is to use alphanumeric codessuch as 1.small, 2.medium, 3.large as the data values as an alphabetic sortthen keeps the values in proper order.



• Interval Scale. You are also allowed to quantify the difference betweentwo interval scale values but there is no natural zero. For example, tem-perature scales are interval data with 25 ◦C warmer than 20 ◦C and a 5◦C difference has some physical meaning. Note that 0 ◦C is arbitrary, sothat it does not make sense to say that 20 ◦C is twice as hot as 10 ◦C.

Values for interval scaled variables are recorded using numbers so thataverages can be taken.

• Ratio Scale. You are also allowed to take ratios among ratio scaledvariables. Physical measurements of height, weight, length are typicallyratio variables. It is now meaningful to say that 10 m is twice as longas 5 m. This ratio hold true regardless of which scale the object is beingmeasured in (e.g. meters or yards). This is because there is a natural zero.

Values for ratio scaled variables are recorded as numbers so that averagescan be taken.

2.5.2 Types of Data

Data can also be classified by its type. This is less important than the scale ofmeasurement, as it usually does not imply a certain type of analysis, but canhave subtle effects.

Discrete data Only certain specific values are valid, points between these val-ues are not valid. For example, counts of people (only integer valuesallowed), the grade assigned in a course (F, D, C-, C, C+, . . .).

Continuous data All values in a certain range are valid. For example, height,weight, length, etc. Note that some packages label interval or ratio dataas continuous. This is not always the case.

Continuous but discretized Continuous data cannot be measured to infiniteprecision. It must be discretized, and consequently is technically discrete.For example, a person’s height may be measured to the nearest cm. Thiscan cause problems if the level of discretization is too coarse. For example,what would happen if a person’s height was measured to the nearest meter.

As a rule of thumb, if the discretization is less than 5% of the typicalvalue, then a discretized continuous variable can be treated as continuouswithout problems.



2.5.3 Roles of data

Some computer packages (e.g. JMP) also make distinctions about the role of avariable.

Label A variable whose value serves as an identification of each observation -usually for plotting

Frequency A variable whose value indicates how many occurrences of thisobservation occur. For example, rather than having 100 lines in a dataset to represent 100 females, you could have one line with a count of 100in the Frequency variable.

Weight This is rarely used. It indicates the weight that this observation is tohave in the analysis. Usually used in advanced analyses.

X Identifies a variables as a ‘predictor’ variable. [Note that the use of theterm ‘independent’ variable is somewhat old fashioned and is falling outof favour.] This will be more useful when actual data analysis is started.

Y Identifies a variable as a ‘response’ variable. [Note tht the use of the term‘dependent’ variable is somewhat old fashione and is falling out of favour.]This will be more useful when actual data analysis is started.

2.6 Bias, Precision, Accuracy

The concepts of Bias, Precision and Accuracy are often used interchangeably innon-technical writing and speaking. However these have very specific statisticalmeanings and it important that these be carefully differentiated.

The first important point about these terms is that they CANNOT be ap-plied in the context of a single estimate from a single set of data. Rather, theyare measurements of the performance of an estimator over repeated samplesfrom the same population. Recall, that a fundamental idea of statistics is thatrepeated samples from the same population will give different estimates, i.e.estimates will vary as different samples are selected. 8

Bias is the difference between average value of the estimator over repeatedsampling from the population and the true parameter value. If the estimatesfrom repeated sampling vary above and below the true population parameter

8The standard error of a estimator measures this variation over repeated samples from thesame population.



value so that the average over all possible samples equals the true parametervalue, we say that the estimator is unbiased.

There are two types of bias - systemic and statistical. Systemic Bias iscaused by problems in the apparatus or the measuring device. For example, ifa scale systematically gave readings that were 10 g too small, this would be asystemic bias. Or is snorkelers in stream survey consistently only see 50% ofthe available fish, this would also be an example of systemic bias. Statisticalbias is related to the choice of sampling design and estimator. For example, theusual sample statistics in an simple random sample give unbiased estimates ofmeans, totals, variances, but not for standard deviations. The ratio estimatorof survey sampling (refer to later chapters) is also biased.

There is no way from the data at hand to detect systemic biases - the re-searcher must examine the experimental apparatus and design very carefully.For example, if repeated surveys were made by snorkeling over sections ofstreams, estimates may be very reproducible (i.e. very precise) but could beconsistently WRONG, i.e. divers only see about 60% of the fish (i.e. biased).Systemic Bias is controlled by careful testing of the experimental apparatus etc.In some cases, it is possible to calibrate the method using "known" popula-tions, – e.g. mixing a solution of a known concentration and then having yourapparatus estimate the concentration.

Statistical biases can be derived from statistical theory. For example, sta-tistical theory can tell you that the sample mean of a simple random sampleis unbiased for the population mean; that the sample VARIANCE is unbiasedfor the population variance; but that the sample standard deviation is a biasedestimator for the population standard deviation. [Even though the sample vari-ance is unbiased, the sample standard deviation is a NON-LINEAR function ofthe variance (i.e. square-rooted) and non-linear functions don’t preserve unbi-asedness.] The ratio estimator is also biased for the population ratio. In manycases, the statistical bias can be shown to essentially disappear with reasonablylarge sample sizes.

Precision of an estimator refers to how variable the repeated estimates willbe over repeated sampling from the same population. Again recall that everydifferent sample from the same population will lead to a different estimate. Ifthese estimates have very little variation over repeated sample, we say that theestimate is precise. The standard error (SE) of the estimator measures thevariation of the estimator over repeated sampling from the same population.

The precision of an estimator is controlled by the sample size. In general, alarger sample size leads to more precise estimates than a smaller sample size.

The precision of an estimator is also determined by statistical theory. Forexample, the precision (standard error) of a sample mean selected using a simple



random sample from a large population is found using mathematics to be equalto pop std dev√

n. A common error is to use this latter formula for all estimators that

look like a mean – however the formula for the standard error of any estimatordepends upon the way the data are collected (i.e. is a simple random sample, acluster sample, a stratified sample etc), the estimator of interest (e.g. differentformulae are used for standard errors of mean, proportions, total, slopes etc.)and, in some cases, the distribution of the population values (e.g. do elementsfrom the population come from a normal distribution, or a Weibull distribution,etc.).

Finally, accuracy is a combination of precision and bias. It measures the“average distance” of the estimator from the population parameter. Technically,one measure of the accuracy of an estimator is the Root Mean Square Error(RMSE) and is computed as

√(Bias)2 + (SE)2. A precise, unbiased estimator

will be accurate, but not all accurate estimators will be unbiased.

The relationship between bias, precision, and accuracy can be view graphi-cally as shown below. Let * represent the true population parameter value (saythe population mean), and periods (.) represent values of the estimator (say



the sample mean) over repeated samples from the same population.

Precise, Unbiased, Accurate Estimator

* Pop mean----------------------------------

.. . .. Sample means

Imprecise, Unbiased, less accurate estimator

* Pop mean----------------------------------

... .. . .. ... Sample means

Precise, Biased, but accurate estimator* Pop mean

----------------------------------... Sample means

Imprecise, Biased, less accurate estimator* Pop mean

----------------------------------.. ... ... Sample means

Precise, Biased, less accurate estimator* Pop mean

----------------------------------... Sample means

Statistical theory can tell if an estimator is statistically unbiased, its pre-cision, and its accuracy if a probabilistic sample is taken. If data is collectedhaphazardly, the properties of an estimator cannot be determined. Systemicbiases caused by poor instruments cannot be detected statistically.



2.7 Types of missing data

Missing data happens frequently. There are three types of missing data and animportant step in any analysis is to thing of the mechanisms that could havecaused the missing data.

First, data can be Missing Completely at Random (MCAR). In thiscase, the missing data is unrelated to the response variable nor to any othervariable in the study. For example, in field trials, a hailstorm destroys a testplot. It is unlikely that the hailstorm location is related to the the responsevariable of the experiment or any other variable of interest to the experiment.In medical trials, a patient may leave the study because of they win the lottery.It is unlikely that this is related to anything of interest in the study.

If data are MCAR, most analyses proceed unchanged. The design may beunbalanced, the estimates have poor precision than if all data were present, butno biases are introduced into the estimates.

Second, data can be Missing at Random (MAR). In this case, the miss-ingness is unrelated to the response variable, but may be related to other vari-ables in the study. For example, suppose that in drug study involving malesand females, that some females must leave the study because they became preg-nant. Again, as long as the missingness is not related to the response variable,the design is unbalanced, the estimates have poorer precision, but no biases areintroduced into the estimates.

Third, and the most troublesome case, is Informative Missing. Here themissingness is related to the response. For example, a trial was conductedto investigate the effectiveness of fertilizer on the regrowth of trees after clearcutting. The added fertilizer increased growth, which attracted deer, which ateall the regrowth! 9

The analyst must also carefully distinguish between values of 0 and missingvalues. They are NOT THE SAME! Here is a little example to illustrate theperils of missing data related to 0-counts. The Department of Fisheries andOceans has a program called ShoreKeepers which allows community groups tocollect data on the ecology of the shores of oceans in a scientific fashion thatcould be used in later years as part of an evironmental assessment study. Aspart of the protocol, volunteers randomly place 1 m2 quadrats on the shore andcount the number of species of various organisms. Suppose the following data

9There is an urban legend about an interview with an opponent of compulsory seat beltlegislation who compared the lengths of stays in hospitals of auto accident victims who were orwere not wearing seat belts. People who wore seat belts spent longer, on average, in hospitalsfollowing the accident than people not wearing seat belts. The opponent felt that this wasevidence for not making seat belts compulsory!



were recorded for three quadrats:

Quadrat Species CountQ1 A 5

C 10Q2 B 5

C 5Q3 A 5

B 10

Now based on the above data, what is the average density of species A? Atfirst glance, it would appear to be (5 + 5)/2 = 5 per quadrat. However, therewas no data recorded for species A in Q2. Does this mean that the densityof species A was not recorded because people didn’t look for species A, or thedensity was not recorded because the density was 0? In the first instance, thevalue of A is Missing at Random from Q2 and the true estimated density ofspecies A is indeed 5. In the second case, the missingness is informative, andthe true estimated density is (5 + 0 + 5)/3 = 3.33 per quadrat.

The above example may seem simplistic, but many database programs areset up in this fashion to “save storage space” by NOT recording zero counts.Unfortunately, one cannot distinguish between a missing value implying thatthe count was zero, or a missing value indicating that the data was not collected.Even worse, many database queries could erroneously treat the missing data asmissing at random and not as zeros giving wrong answers to averages!

For example, the Breeding Bird Survey is an annual survey of birds that fol-lows specific routes and records the number of each type of species encountered.According to the documentation about this survey 10 only the non-zero countsare stored in the database and some additional information such as the numberof actual routes run is required to impute the missing zeroes:

“Since only non-zero counts are included in the database, the com-plete list of years a route is run allows the times in which the specieswasn’t seen to be identified and included in the data analysis.”

If this extra step was not done, then you would be in the exact problem aboveon quadrat sampling.

The moral of the story is that 0 is a valid value and should be recorded assuch! Computer storage costs are declining so quickly, that the “savings” by notrecording 0’s soon vanish when people can’t or don’t remember to adjust forthe unrecorded 0 values.

10http://www.cws-scf.ec.gc.ca/nwrc-cnrf/migb/stat_met_e.cfm


http://www.cws-scf.ec.gc.ca/nwrc-cnrf/migb/stat_met_e.cfm


If your experiment or survey has informative missing values, you could havea serious problem in the analysis and expert help should be consulted.

2.8 Transformations

2.8.1 Introduction

Many of the procedures in this course have an underlying assumption that thedata from each group is normally distributed with a common variance. In somecases this is patently false, e.g. the data are highly skewed with variances thatchange, often with the mean.

The most common method to fix this problem is a transformation of the dataand the most common transformation in ecology is the logarithmic transform,i.e. analyze the log(Y ) rather than Y . Other transformations are possible -these will not be discussed in this course, but the material below applies equallywell to these other transformations.

If you are unsure of the proper transformation, there are a number of meth-ods than can assist including a Box-Cox Transform and an applicaiton of Tay-lor’s Power Law. These are beyond the scope of this course.

The logarithmic transform is often used when the data are positive andexhibit a pronounced long right tail. For example, the following are plots of(made-up) data before and after a logarithmic transformation:



There are several things to note in the two graphs.

• The distribution of Y is skewed with a long right tail, but the distributionof log(Y ) is symmetric.

• The mean is the right of the median in the original data, but the meanand median are the same in the transformed data.

• The standard deviation of Y is large relative to the mean (cv = std devmean =

131421 = 31%) where as the standard deviation is small relative to the meanon the transformed data (cv = std dev

mean = .36.0 = 5%).

• The box-plots show a large number of potential “outliers” in the originaldata, but only a few on the transformed data. It can be shown thatin the case of a a log-normal distribution, about 5% of observations aremore than 3 standard deviations from the mean compared to a normaldistribution with less than 1/2 of 1% of such observations.

The form of the Y data above occurs quite often in ecology and is oftencalled a log-normal distribution given that a logarithmic transformation seemsto “normalize” the data.



2.8.2 Conditions under which a log-normal distributionappears

Under what conditions would you expect to see a log-normal distribution? Nor-mal distributions often occur when the observed variable is the “sum” of under-lying processes. For example, heights of adults (within a sex) are fit very closelyby a normal distribution. The height of a person is determined by the “sum” ofheights of the shin, thigh, trunk, neck, head and other portions of the body. Afamous theorem of statistics (the Central Limit Theorem) says that data thatare formed as the “sum” of other data, will tend to have a normal distribution.

In some cases, the underlying process act multiplicatively. For example, thedistribution of household income is often a log-normal distribution. You canimagine that factor such as level of education, motivation, parental supportact to “multiply” income rather than simply adding a fixed amount of money.Similarly, data on animal abundance often has a log-normal distribution becausefactors such as survival act multiplicatively on the populations.

2.8.3 ln vs log

There is often much confusion about the form of the logarithmic transformation.For example, many calculators and statistical packages differential between thecommon logarithm (base 10, or log) and the natural logarithm (base e or ln).Even worse, is that many packages actually use log to refer to natural log-arithms and log10 to refer to common logarithms. IT DOESN’T MATTERwhich transformation is used as long as the proper back-transformation is ap-plied. When you compare the actual values after these transformations, youwill see that ln(Y ) = 2.3log10(Y ), i.e. the log-transformed values differ by afixed multiplicative constant. When the anti-logs are applied this constant will“disappear”.

In accordance with common convention in statistics and mathematics, theuse of log(Y ) will refer to the natural or ln(Y ) transformation.

2.8.4 Mean vs Geometric Mean

The simple mean of Y is called the arithmetic mean (or simply) the mean andis computed in the usual fashion.

The anti-log of the mean of the log(Y ) values is called the geometric mean.The geometric mean of a set of data is ALWAYS less than the mean of the



original data. In the special case of log-normal data, the geometric mean willbe close to the MEDIAN of the original data.

For example, look that the data above. The mean of Y is 421. The meanof log(Y ) is 5.999 and exp(5.999) = 403 which is close to the median of theoriginal data.

This implies that when reporting results, you will need to be a little carefulabout how the back-transformed values are interpreted.

It is possible to go from the mean on the log-scale to the mean on the anti-logscale and vice-versa. For log-normal data, 11 it turns out that

Y antilog ≈ exp(Y log +s2log2

)

and

Y log ≈ log(Y antilog)−s2antilog

2Y2

antilog

In this case:

Y antilog ≈ exp(5.999 +(.3)2

2) = 422.

and

Y log ≈ log(421.81)−131.22

2× 421.82= 5.996

Unfortunately, the formula for the standard deviations is not as straightforward. There is somewhat complicated formula available in many referencebooks, but a close approximation is that:

sanitlog ≈ slog × exp(Y log)

andslog ≈

santilog

Y antilog

For the data above we see that:

santilog ≈ .3× exp(5.999) = 121

andslog ≈

131.21

421.81=, 311

which is close, but not exactly on the money.11Other transformation will have a different formula



2.8.5 Back-transforming estimates, standard errors, andci

Once inference is made on the transformed scale, it is often nice to back-transform and report results on the original scale.

For example, a study of turbidity (measured in NTU) on a stream in BCgave the following results on the log-scale:

Statistics valueMean on log scale 5.86Std Dev on log scale .96SE on log scale 0.27upper 95% ci Mean 6.4lower 95% ci Mean 5.3

How should these be reported on the original NTU scale?

Mean on log-scale back to MEDIAN on anti-log scale

The simplest back-transform goes from the mean on the log-scale to the ME-DIAN on the anti-log scale. The distribution is often symmetric on the log-scaleso the mean, median, and mode on the log-scale all co-incide. However, whenyou take anti-logs, the upper tail gets larger much faster than the lower tail andthe anti-log transform re-introduces skewness into the back-transformed data.Hence, the center point on the log-scale gets back-transformed to the medianon the anti-logscale.

The estimated MEDIAN (or GEOMETRIC MEAN) on the original scale isfound by the back transform of the mean on the log-scale, i.e.

mediananti-log = expmeanlog−scale

estimatedmedian = exp(5.86) = 350NTU.

The 95% confidence interval for the MEDIAN is found by doing a simple back-transformation on the 95% confidence interval for the mean on the log-scale, i.e.from exp(5.3) = 196 to exp(6.4) = 632 NTUs. Note that the confidence intervalon the back-transformed scale is no longer symmetric about the estimate.

There is no direct back-transformation of the standard error from the log-scale to the original scale, but an approximate standard error on the back-transformed scale is found as seantilog = selog × exp(5.86) = 95 NTUs.

If the MEAN on the anti-log scale is needed, recall from the previous section



that

Meanantilog ≈ exp(Meanlog +std dev2log

2)

Meanantilog ≈ exp(Meanlog)× exp(std dev2log

2)

Meanantilog ≈ Median× exp(std dev2log

2)

Meanantilog ≈ Median× exp( .962

2) = Median× 1.58.

Hence multiply the median, standard error of the median, and limits of the 95%confidence interval all by 1.58.

2.8.6 Back-transforms of differences on the log-scale

Some care must be taken when back-transforming differences on the log-scale.The general rule of thumb is that a difference on the log-sacle corresponds to alog(ratio) on the original scale. Hence a back-transform of a difference on thelog-scale corresponds to a ratio on the original scale. 12

For example, here are the results from a study to compare turbidity beforeand after remediation was completed on a stream in BC.

Statistics value on log-scaleDifference −0.8303Std Err Dif 0.3695Upper CL Dif −0.0676Lower CL Dif −1.5929p-value 0.0341

A difference of−.83 units on the log-scale corresponds to a ratio of exp(−.83) =.44 in the NTU on the original scale. In otherwords, the median NTU after re-mediation was .44 times that of the median NTU before mediation. Of themedian NTU before remediation was exp(.83) = 2.29 times that of the medianNTU after remediation. Note that 2.29=1/.44.

The 95% confidence intervals are back-tranformed in a similar fashion. Inthis case the 95% confidence interval on the RATIO of median NTUs lies be-tween exp(−1.59) = .20 to exp(−.067) = .93, i.e. the median NTU after reme-diation was between .20 and .95 of the median NTU before remediation.

12Recall that log(YZ) = log(y)− log(Z)



If necessary you could also back-transform the standard error to get a stan-dard error for the ratio on the original scale, but this is rarely done.

2.8.7 Some additional readings on the log-transform

Here are some additional readings on the use of the log-transform taken fromthe WWW. The URL is presented at the bottom of each page.


Stats: Log transformation 2/18/05 10:51

- 1 (7) -

http://www.cmh.edu/stats/model/linear/log.asp

Search

Stats >> Model >> Log transformation

Dear Professor Mean, I have some data that I need help with analysis. One suggestion is that I use a logtransformation. Why would I want to do this? -- Stumped Susan

Dear Stumped

Think of it as employment security for us statisticians.

Short answer

If you want to use a log transformation, you compute the logarithm of each data value andthen analyze the resulting data. You may wish to transform the results back to the originalscale of measurement.

The logarithm function tends to squeeze together the larger values in your data set andstretches out the smaller values. This squeezing and stretching can correct one or more of thefollowing problems with your data:

1. Skewed data2. Outliers3. Unequal variation

Not all data sets will suffer from these problems. Even if they do, the log transformation is notguaranteed to solve these problems. Nevertheless, the log transformation works surprisinglywell in many situations.

Furthermore, a log transformation can sometimes simplify your statistical models. Somestatistical models are multiplicative: factors influence your outcome measure throughmultiplication rather than addition. These multiplicative models are easier to work with after alog transformation.

If you are unsure whether to use a log transformation, here are a few things you should lookfor:

1. Is your data bounded below by zero?2. Is your data defined as a ratio?3. Is the largest value in your data more than three times larger than the smallest

value? 66


- 2 (7) -


Squeezing and stretching

The logarithm function squeezes together big data values (anything larger than 1). Thebigger the data value, the more the squeezing. The graph below shows this effect.

The first two values are 2.0 and 2.2. Their logarithms, 0.69 and 0.79 are much closer. Thesecond two values, 2.6 and 2.8, are squeezed even more. Their logarithms are 0.96 and 1.03.

The logarithm also stretches small values apart (values less than 1). The smaller the valuesthe more the stretching. This is illustrated below.

67


- 3 (7) -


The values of 0.4 and 0.45 have logarithms (-0.92 and -0.80) that are further apart. The valuesof 0.20 and 0.25 are stretched even further. Their logarithms are -1.61 and -1.39, respectively.

Skewness

If your data are skewed to the right, a log transformation can sometimes produce a dataset that is closer to symmetric. Recall that in a skewed right distribution, the left tail (thesmaller values) is tightly packed together and the right tail (the larger values) is widely spreadapart.

The logarithm will squeeze the right tail of the distribution and stretch the left tail, whichproduces a greater degree of symmetry.

If the data are symmetric or skewed to the left, a log transformation could actually makethings worse. Also, a log transformation is unlikely to be effective if the data has a narrowrange (if the largest value is not more than three times bigger than the smallest value).

Outliers

If your data has outliers on the high end, a log transformation can sometimes help. Thesqueezing of large values might pull that outlier back in closer to the rest of the data. If yourdata has outliers on the low end, the log transformation might actually make the outlier worse,since it stretches small values.

Unequal variation

Many statistical procedures require that all of your subject groups have comparablevariation. If you data has unequal variation, then the some of your tests and confidenceintervals may be invalid. A log transformation can help with certain types of unequal variation.

A common pattern of unequal variation is when the groups with the large means also tendto have large standard deviations. Consider housing prices in several different neighborhoods.In one part of town, houses might be cheap, and sell for 60 to 80 thousand dollars. In a differentneighborhood, houses might sell for 120 to 180 thousand dollars. And in the snooty part oftown, houses might sell for 400 to 600 thousand dollars. Notice that as the neighborhoods gotmore expensive, the range of prices got wider. This is an example of data where groups withlarge means tend to have large standard deviations.68


- 4 (7) -


With this pattern of variation, the log transformation can equalize the variation. The logtransformation will squeeze the groups with the larger standard deviations more than itwill squeeze the groups with the smaller standard deviations. The log transformation isespecially effective when the size of a group's standard deviation is directly proportional to thesize of its mean.

Multiplicative models

There are two common statistical models, additive and multiplicative. An additive modelassumes that factors that change your outcome measure, change it by addition orsubtraction. An example of an additive model would when we increase the number of mailorder catalogs sent out by 1,000, and that adds an extra $5,000 in sales.

A multiplicative model assumes that factors that change your outcome measure, change itby multiplication or division. An example of a multiplicative model woud be when an inch ofrain takes half of the pollen out of the air.

In an additive model, the changes that we see are the same size, regardless of whether we are onthe high end or the low end of the scale. Extra catalogs add the same amount to our salesregardless of whether our sales are big or small. In a multiplicative model, the changes we seeare bigger at the high end of the scale than at the low end. An inch of rain takes a lot of pollenout on a high pollen day but proportionately less pollen out on a low pollen day.

If you remember your high school algebra, you'll recall that the logarithm of a product is equalto the sum of the logarithms.

Therefore, a logarithm converts multiplication/division into addition/subtraction. Another wayto think about this in a multiplicative model, large values imply large changes and small valuesimply small changes. The stretching and squeezing of the logarithm levels out the changes.

When should you consider a log transformation?

There are several situations where a log transformation should be given special consideration.

Is your data bounded below by zero? When your data are bounded below by zero, you oftenhave problems with skewness. The bound of zero prevents outliers on the low end, andconstrains the left tail of the distribution to be tightly packed. Also groups with means close tozero are more constrained (hence less variable) than groups with means far away from zero.

It does matter how close you are to zero. If your mean is within a standard deviation or two ofzero, then expect some skewness. After all the bell shaped curve which speads out about threestandard deviations on either side would crash into zero and cause a traffic jam in the left tail.

Is your data defined as a ratio? Ratios tend to be skewed by their very nature. They also tendto have models that are multiplicative.

Is the largest value in your data more than three times larger than the smallest value? The69


- 5 (7) -


relative stretching and squeezing of the logarithm only has an impact if your data has a widerange. If the maximum of your data is not at least three times as big as your minimum, then thelogarithm can't squeeze and stretch your data enough to have any useful impact.

Example

The DM/DX ratio is a measure of how rapidly the body metabolizes certain types ofmedication. A patient is given a dose of dextrometorphan (DM), a common cough medication.The patients urine is collected for four hours, and the concentrations of DM and DX (ametabolite of dextrometorphan) are measured. The ratio of DM concentration to DX is ameasure of how well the CYD 2D6 metabolic pathway functions. A ratio less than 0.3 indicatesnormal metabolism; larger ratios indicate slow metabolism.

Genetics can influence CYP 2D6 metabolism. In this set of 206 patients, we have 15 with nofunctional alleles and 191 with one or more functional alleles.

The DM/DX ratio is a good candidate for a log transformation since it is bounded below byzero. It is also obviously a ratio. The standard deviation for this data (0.4) is much larger thanthe mean (0.1).

Finally, the largest value is several orders of magnitude bigger than the smallest value.

Skewness

The boxplots below show the original (untransformed) data for the 15 patients with nofunctional alleles. The graph also shows the log transformed data. Notice that the untransformeddata shows quite a bit of skewness. The lower whisker and the lower half of the box are muchpacked tightly, while the upper whisker and the upper half of the box are spread widely.

The log transformed data, while not perfectly symmetric, does tend to have a better balancebetween the lower half and the upper half of the distribution.

70


- 6 (7) -


Outliers

The graph below shows the untransformed and log transformed data for the subset of patientswith exactly two functional alleles (n=119). The original data has two outliers which are almost7 standard deviations above the mean. The log transformed data are not perfect, and perhapsthere is now an outlier on the low end. Nevertheless, the worst outlier is still within 4 standarddeviations of the mean. The influence of outliers is much less extreme with the log transformeddata.

Unequal variation

When we compute standard deviations for the patients with no functional alleles and thepatients with one or more functional alleles, we see that the former group has a much largerstandard deviation. This is not too surprising. The patients with no functional alleles are furtherfrom the lower bound and thus have much more room to vary.

After a log transformation, the standard deviations are much closer.

71


- 7 (7) -


Summary

Stumped Susan wants to understand why she should use a log transformation for her data.Professor Mean explains that a log transformation is often useful for correcting problems withskewed data, outliers, and unequal variation. This works because the log function squeezesthe large values of your data together and stretches the small values apart. The logtransformation is also useful when you believe that factors have a mutliplicative effect. Youshould consider a log transformation when your data are bound below by zero, when you dataare defined as a ratio, and/or when the largest value in your data is at least three times as big asthe smallest value.

Related pages in Stats

Stats: Geometric mean

Further reading

The log transformation is special. Keene ON. Stat Med 1995: 14(8); 811-9. [Medline]

Stats >> Model >> Log transformation

This page was last modified on 01/10/2005 . Send feedback to ssimon at cmh dot edu or click on the emaillink at the top of the page.

72

Confidence Intervals Involving Logarithmically Transformed Data 2/18/05 10:52

- 1 (3) -

http://www.tufts.edu/~gdallal/ci_logs.htm

Confidence Intervals Involving Datato Which a Logarithmic Transformation Has Been Applied

These data were originally presented in Simpson J, Olsen A, and Eden J(1975), "A Bayesian Analysis of a Multiplicative Treatment effect inWeather Modification," Technometrics, 17, 161-166, and subsequentlyreported and analyzed by Ramsey FL and Schafer DW (1997), TheStatistical Sleuth: A Course in Methods of Data Analysis. Belmont, CA:Duxbury Press. They involve an experiment performed in southernFlorida between 1968 and 1972. An aircraft was flown through a seriesof cloud and, at random, seeded some of them with massive amounts ofsilver iodide. Precipitation after the aircraft passed through was measuredin acre-feet.

The distribution of precipitation within group (seeded or not) is positivelyskewed (long-tailed to the right). The group with the higher mean has aproportionally larger standard deviation as well. Both characteristicssuggest that a logarithmic transformation be used to make the data moresymmetric and homoscedastic (more equal spread). The second pair ofbox plots bears this out. This transformation will tend to make CIs morereliable, that is, the level of confidence is more likely to be what isclaimed.

N Mean Std. Deviation MedianNot Seeded 26 164.6 278.4 44.2

RainfallSeeded 26 442.0 650.8 221.6

N Mean Std. Deviation Geometric MeanNot Seeded 26 1.7330 .7130 54.08

LOG_RAINSeeded 26 2.2297 .6947 169.71

95% Confidence Interval for the Mean DifferenceSeeded - Not Seeded

(logged data)Lower Upper

Equal variances assumed 0.1046 0.8889

Equal variances not assumed 0.1046 0.8889

Researchers often transform data back to the original scale when a logarithmictransformation is applied to a set of data. Tables might include Geometric Means,which are the anti-logs of the mean of the logged data. When data are positivelyskewed, the geometric mean is invariably less than the arithmetic mean. This leadsto questions of whether the geometric mean has any interpretation other than as theanti-log of the mean of the log transformed data.

73


- 2 (3) -


The geometric mean is often a good estimate of the original median. Thelogarithmic transformation is monotonic, that is, data are ordered the same way inthe log scale as in the original scale. If a is greater than b, then log(a) is greaterthan log(b). Since the observations are ordered the same way in both the originaland log scales, the observation in the middle in the original scale is also theobservation in the middle in the log scale, that is,

the log of the median = the median of the logs

If the log transformation makes the population symmetric, then the populationmean and median are the same in the log scale. Whatever estimates the mean alsoestimates the median, and vice-versa. The mean of the logs estimates both thepopulation mean and median in the log transformed scale. If the mean of the logsestimates the median of the logs, its anti-log--the geometric mean--estimates themedian in the original scale!

The median rainfall for the seeded clouds is 221.6 acre-feet. In the picture, the solid line between the twohistograms connects the median in the original scale to the mean in the log-transformed scale.

One property of the logarithm is that "the difference between logs is the log of the ratio", that is, log(x)-log(y)=log(x/y). The confidence interval from the logged data estimates the difference between thepopulation means of log transformed data, that is, it estimates the difference between the logs of thegeometric means. However, the difference between the logs of the geometric means is the log of the ratio ofthe geometric means. The anti-logarithms of the end points of this confidence interval give a confidenceinterval for the ratio of geometric means itself. Since the geometric mean is sometime an estimate of themedian in the original scale, it follows that a confidence interval for the geometric means is approximately aconfidence interval for the ratio of the medians in the original scale.

In the (common) log scale, the mean difference between seeded and unseeded clouds is 0.4967. Our bestestimate of the ratio of the median rainfall of seeded clouds to that of unseeded clouds is 100.4967 [= 3.14].Our best estimate of the effect of cloud seeding is that it produces 3.14 times as much rain on average as notseeding.

Even when the calculations are done properly, the conclusion is often misstated.

The difference 0.4967 does not mean seeded clouds produce 0.4967 acre-feet more rain that unseededclouts. It is also improper to say that seeded clouds produce 0.4967 log-acre-feet more than unseededclouds.

The 3.14 means 3.14 times as much. It does not mean 3.14 times more (which would be 4.14 times asmuch). It does not mean 3.14 acre-feet more. It is a ratio and has to be described that way.

The a 95% CI for the population mean difference (Seeded - Not Seeded) is (0.1046, 0.8889). For reportingpurposes, this CI should be transformed back to the original scale. A CI for a difference in the log scalebecomes a CI for a ratio in the original scale.

The antilogarithms of the endpoints of the confidence interval are 100.1046 = 1.27, and 100.8889 = 7.74.74


- 3 (3) -


Thus, the report would read: "The geometric mean of the amount of rain produced by a seeded cloud is 3.14times as much as that produced by an unseeded cloud (95% CI: 1.27 to 7.74 times as much)." If the loggeddata have a roughly symmetric distribution, you might go so far as to say,"The median amount of rain...isapproximately..."

Comment: The logarithm is the only transformation that produces results that can be cleanly expressed interms of the original data. Other transformations, such as the square root, are sometimes used, but it isdifficult to restate their results in terms of the original data.

Copyright © 2000 Gerard E. DallalLast modified: Mon Sep 30 2002 14:15:42.

75


2.9 Standard deviations and standard errors re-visited

The use of standard deviations and standard errors in reports and publicationscan be confusing. Here are some typical questions asked by students about thesetwo concepts.

I am confused about why different graphs in different publicationdisplay the mean ± 1 standard deviation; the mean ± 2 standarddeviations; the mean ± 1 se; or the mean ± 2 se. When should eachgraph be used?

What is the difference between a box-plot; ±2se; and ± 2 standarddeviations?

The foremost distinction between the use of standard deviation and standarderrors can be made as follows:

Standard deviations should be used when information about INDI-VIDUAL observations is to be conveyed; standard errors should beused when information about the precision of an estimate is to beconveyed.

There are in fact, several common types of graphs that can be used todisplay the distribution of the INDIVIDUAL data values. Common displaysfrom "closest to raw data" to "based on summary statistics" are:

• dot plots

• stem and leaf plots

• histograms

• box plots

• mean ± 1 std dev. NOTE this is NOT the same as the estimate ± 1 se

• mean ± 2 std dev. NOTE this is NOT the same as the estimate ± 2 se

The dot plot is a simple plot of the actual raw data values (e.g. that seen inJMP when the Analyze->Fit Y-by-X platform is invoked. It is used to check for



outliers and other unusual points. Often jittering is used to avoid overprintingany duplicate data points. It useful for up to about 100 data points. Here is anexample of a dot plot of air quality data in several years:

Stem and leaf plots and histograms are similar. Both first start by creating‘bins’ representing ranges of the data (e.g. 0→4.9999, 5→9.9999, 10→15.9999,etc.). Then the number of data points in each bin is tabulated. The displayshows the number or the frequency in each bin. The general shape of the datais examined (e.g. is it symmetrical, or skewed, etc).

Here are two examples of histograms and stem-and-leaf charts:



The box-plot is an alternate method of displaying the individual data values.The box portion displays the 25th, 50th, and 75th percentiles 13 of the data. The

13 The pth percentile in a data set is the value such that at least p% of the data are lessthan the percentile; and at least (100-p)% of the data values are greater than the percentile.For example, the median=.5 quantile = 50th percentile is the value such that at least 50%of the data values are below the median and at least 50% of the data values are above themedian. The 25th percentile=.25 quantile = 1st quartile is the value such that at least 25%



definition of the extent of the whiskers depends upon the statistical package, butgenerally stretch to show the "typical" range to be expected from data. Outliersmay be “indicated” in some plots.

The box-plot is an alternative (and in my opinion a superior) display toa graph showing the mean ± 2 standard deviations because it conveys moreinformation. For example, a box plot will show if the data are symmetric (25th,50th, and 75th percentiles roughly equally spaced) or skewed (the median muchcloser to one of the 25th or 75th percentiles). The whiskers show the range ofthe INDIVIDUAL data values. Here is an example of side-by-side box plots ofthe air quality data:

The mean ± 1 STD DEV shows a range where you would expect about68% of the INDIVIDUAL data VALUES assuming the original data came froma normally distributed population. The mean ± 2 STD DEV shows a rangewhere you would expect about 95% of INDIVIDUAL data VALUES assumingthe original data came from a normally distributed population. The latter twoplots are NOT RELATED to confidence intervals! This plot might be usefulwhen the intent is to show the variability encountered in the sampling or thepresence of outliers etc. It is unclear why many journals still accept graphs with± 1 standard deviation as most people are interested in the range of the datacollected so ± 2 standard deviations would be more useful. Here is a plot of the

of the data values are less than the value and at least 75% of the data values are greater thanthis value.



mean ± 1 standard deviation – plots of the mean ± 2 standard deviations arenot available in JMP :

I generally prefer the use of dot plots and bax-plots are these are much moreinformative than stem-and-leaf plots, histograms, or the mean ± some multipleof standard deviations.

Then there are displays showing precision of estimates: Common displaysare:

• mean ± 1 SE

• mean ± 2 SE

• lower and upper bounds of confidence intervals

• diamond plots

These displays do NOT have anything to do with the sample values - theyare trying to show the location of plausible values for the unknown populationparameter - in this case - the population mean. A standard error measureshow variable an estimate would likely be if repeated samples/experiments fromthe same population were performed. Note that a se says NOTHING aboutthe actual sample values! For example, it is NOT correct to say that a 95%confidence interval contains 95% of INDIVIDUAL data values.



The mean ± 1 SE display is not very informative as it corresponds to anapproximate 68% confidence interval. The mean ± 2 SE corresponds to anapproximate 95% confidence interval IN THE CASE OF SIMPLE RANDOMSAMPLING. Here is a plot of the mean ± 1 se – a plot of the mean ± 2 se isnot available directly in JMP except as bar above and below the mean.



Alternatively, JMP can plot confidence interval diamonds.



Graphs showing ± 1 or 2 standard errors are showing the range of plausiblevalues for the underlying population mean. It is unclear why many journalsstill publish graphs with ± 1 se as this corresponds to an approximate 68%confidence interval. I think that a 95% confidence interval would be more usefulcorresponding to ± 2 se.

Caution. Often these graphs (e.g. created by Excel) use the simple formulafor the se of the sample mean collected under a simple random sample even ifthe underlying design is more complex! In this case, the graph is in error andshould not be interpreted!

Both the confidence interval and the diamond plots (if computed correctlyfor a particular sampling design and estimator) correspond to a 95% confidenceinterval.

2.10 Other tidbits

2.10.1 Interpreting p-values

I have a question about p-values. I’m confused about the wordingused when they explain the p-value. They say ‘with p=0.03, in 3percent of experiments like this we would observe sample means asdifferent as or more different than the ones we got, if in fact the nullhypothesis were true.’ The part that gets me is the ‘as different asor more different than’. I think I’m just having problems puttingit into words that make sense to me. Do you have another way ofsaying it?

The p-value measures the ‘unusualness’ of the data assuming that the nullhypothesis is true. The ‘confusing’ part is how to measure unusualness.

For example; is a person 7 ft (about 2 m) unusually tall? Yes, because onlya small fraction of people are AS TALL OR TALLER.

Now if the hypothesis is 2-sided, both small and large values of the samplemean (relative to the hypothesized value) are unusual. For example, supposethat null hypothesis is that the mean amount in bottles of pop is 250 mL. Wewould be very surprised if the sample mean was very small (e.g. 150 mL) orvery large (e.g. 350 mL).

That is why, for a two-sided test, the unusualness is ‘as different or moredifferent’. You aren’t just interested in the probability of getting exactly 150



or 350, but rather in the probability that the sample mean is < 150 or > 350(analogous to the probability of being 7 ft or higher).

2.10.2 False positives vs false negatives

What is the difference between a false positive and a false negative

A false positive (Type I) error occurs if you conclude that the evidenceagainst the hypothesis of interest is strong, when, in fact, the hypothesis istrue. For example, in a pregnancy test, the null hypothesis is that the person isNOT pregnant. A false positive reading would indicate that the test indicatesa pregnancy, when in fact the person is not pregnant. A false negative (TypeII error) occurs if insufficient evidence against the null hypothesis, in fact, thehypothesis is false. In the case of a pregnancy test, a false negative would occurif the test indicates not pregnant, when in fact, the person is pregnant.

2.10.3 Specificity/sensitivity/power

Please clarify specificity/sensitivity/power of a test. Are they thesame?

The power and sensitivity are two terms for the ability to find sufficientevidence against the the null hypothesis when, in fact, the null hypothesis isfalse. For example, a pregnancy test with a 99that the test correctly identifiesa pregnancy when in fact the person is pregnant.

The specificity of a test indicates the ability to NOT find evidence againstthe null hypothesis when the null hypothesis is true - the opposite of a TypeI error. A pregnancy test would have high specificity if it rarely declares apregnancy for a non-pregnant person.


Date post:	06-Feb-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Sampling, Regression, Experimental Design and Analysis for...

Documents