+ All Categories
Home > Documents > Descriptive Statistics (2)

Descriptive Statistics (2)

Date post: 10-Nov-2015
Category:
Upload: valeed
View: 222 times
Download: 0 times
Share this document with a friend
101
BIOSTATISTICS The word statistics is a Latin word derived from status meaning information useful to the state, e.g., the sizes of the populations and armed forces.
Transcript

Introduction to Biostats

BiostatisticsThe word statistics is a Latin word derived from status meaning information useful to the state, e.g., the sizes of the populations and armed forces.BiostatisticsStatistics refers to the numerical data relating to an aggregate of facts.

Also used to refer to the procedures and techniques used to collect, process and analyze data to make inferences and to reach decisions in the face of uncertainty. Important characteristics of Biostatistics:It deals with uncertainties in population groups and events.

It deals with data subjected to random variations like height of children etc.

The study design and data collection procedures have to be correct to obtain meaningful statistics.Biostatistics can be divided into:

1. Descriptive: Deals with the concepts and methods concerned with summarization and description of the important aspects of the numerical data.

2. Inferential: Deals with procedures for making inferences about the characteristics of the large groups of populations by using a part of the data called the sample population.Definitions

Population is a set of measurement of interest to the sample collector.

Sample is any subset of measurements selected from the population.

Element/Unit an entity on which measurements are obtained.

Observation set of measurement obtained for each element

Data facts and figures collected, summarised and analysed.

Data set a set of different variables in a particular study. Statistical analyses need variability; otherwise there is nothing to studyStatistics is concerned, mainly, with variables

Variation is important!!!!Any type of observation which can take different values for different people, times, places, species etc is called a

VARIABLE

Eg., height, weight, uric acid level, Xrays findings, parity, social class etc.

A mathematical constant takes a fixed value eg., the ratio of the circumference of a circle to its diameter is a constant, 3.141592654 for all sized circles

Types of variables

A QUALITATIVE variable is one which doesnot take a numerical value. It may be concerned with the characteristics eg., gender, survival or death, place of birth, colour of eyes etc.A QUANTITATIVE variable takesa numerical value. eg., height, blood pressure,lung capacity, exact age, parity, number of cases in a study, completed family size, age last birthday etc.Continuous(real-valued)e.g. height

Discrete(count data)e.g. number of admissions

Ordinal(ordered)e.g. response to treatment

Nominal(not ordered)e.g. ethnic group

Quantitativemeasurement

VariableQualitativeor categorical

Types of Variables 13Categorical VariablesCannot be measured numerically

Categories must not overlap and must cover all possibilities

14Mean of genderCategorical Nominal VariablesNamed categoriesNo implied order among categoriesExamples:Gender: Male/FemaleBlood Groups: 0, A, B, ABEthnic Group: Chinese, Malay, Indian, JordanianEye color: brown/black/blue/green/mixed15Ordering is arbitrary and no information is gained or lost by changing the order. Categorical Ordinal VariablesSame as nominal but ordered categoriesDifferences between categories may not be considered equalExamples:Grading: Excellent, satisfactory, unsatisfactoryPain severity: no pain, slight pain, moderate pain, severe pain 16Many health care variables are ordinal in nature much improved, somewhat improved same worse dead. Stage 1, 2 3 4 cancer. Use the difference test. Is the difference between stage one, and two the same as stage 3 and 4? If not then the data is ordinal.

Ordinal can be named categories or numerical An example of numerical is ranks of students in a class baswd on grades. Could rank tham then take th median of the ranks and that whatever that number is would be the median grade for the class. Think about this need a review of median for ordinal dataQuantitative VariablesCan be measured numericallyExamples:weight# of admissions to the hospitalconcentration of chlorine

Can be discrete or continuous17Discrete Numerical VariablesIntegers that correspond to a countCan assume only whole numbers Examples:# of bacterial colonies on a plate# of missing teeth# of accidents in a time period# of illnesses in a time period18Left out interval and ratio data like IQ which is interval and equal distances between values and nonmeningful zero vs. ratio data with meaningful zero. Since interval and ratio data are analyzed the same way statistically I feel this is too much detail.

Ranks with equal intervals are interval data and can use mean

Discrete numerical is like interval data Need to check this and find out what kind of tests can be used fior thisContinuous DataContinuous data are measuredCan take any value within a defined rangeLimitations imposed by the measuring stick

Examples: blood pressure, height, weight, time

19Why Does it Matter?Categorical and quantitative variables are statistically summarized and presented in different ways

Variable Type Data PresentationQuantitative Graphs, TablesCategorical Charts, Tables20TYPES of DATA

Qualitative data Categorical data Quantitative data Numerical data

Qualitative/Categorical DataThere are two types of categorical data:nominal ordinal data.

NOMINAL DATAIn NOMINAL DATA, the variables are divided into named categories. These categories however, cannot be ordered one above another (as they are not greater or less than each other).Example: NOMINAL DATA CATEGORIESSex/ Gender: male, femaleMarital status: single, married, widowed, separated, divorced

ORDINAL DATAIn ORDINAL DATA, the variables are also divided into a number of categories, but they can be ordered one above another, from lowest to highest or vice versa. Example: ORDINAL DATACATEGORIESLevel of knowledge: good, average, poorOpinion on a statement:fully agree, agree, disagree, totally disagree Numerical Data We speak of NUMERICAL DATA if the VARIABLES are expressed in numbers. They can be examined through:Frequency DistributionPercentages, Proportions, Ratios and RatesFigures ETC.Numerical DataMay be:Discrete or ContinuousDiscrete numerical data considers counts which can be expressed only as whole numbers e.g., number of people, parity, number of males/females in a family etc.

Continuous numerical data considers measures which can take any value between two whole numbers e.g., weight, height, uric acid levels etc. SCALES OF MEASUREMENT

There are four scales (or levels) at which we measure:

__________________________________________________________LowestLevel Scale Characteristic__________________________________________________________Nominal namingOrdinal ordering Interval equal interval without absolute zeroRatioequal interval with absolute zero__________________________________________________________Highest__________________________________________________________

Data summarizationMeasures of Central LocationMeasures of Dispersion and Measures of Shapes

Central Location SpreadNumber of peopleAge??28Definition: a single value that represents (is a good summary of) an entire distribution of data

Also known as:Measure of central tendencyMeasure of central position

Common measuresArithmetic meanMedianModeMeasures of Central Location29Age2730283128362937293430302730283132302929Raw data set:Ages of students in a class (years)30To best understand the concepts and calculations of position, we will walk through the following dataset.Here we have a raw dataset of ages of students in a class by year.ObsAge12722732842852862972982992910301130123013301430153116311732183419362037Add observation numbersOrder the data set from the lowest value to the highest value31We then add observation numbers for ease of calculating position values.Method for identification1.Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs2.Identify the value that occurs most oftenModeDefinition: Mode is the value that occurs most frequently32The mode is the simplest measure of central location. It requires no calculations. It simply is the the value that occurs most frequently.

So, to find the mode, 1. Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs.2. Identify the value that occurs most often.

AgeFrequency272283294305312321330341350361371Total20ModeObsAge12722732842852862972982992910301130123013301430153116311732183419362037Mode33From the raw data we can create a frequency distribution.The mode is the value that occurs most frequently. From the frequency distribution you can see that age 30 is the most common age (occurring 5 times), so the mode of this distribution is 30.ObsAge12722732842852862972982992910301130123013301430153116311732183419362037The most frequent value of the variable76543212728293031323334353637Mode = 30Age (years)FrequencyMode34It is even easier to pick out the mode from a histogram.

Here is a histogram, which is basically a graphical representation of the frequency distribution. The tallest column represents the mode. Here, the mode is at age 30. 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12,12, 13, 14, 16, 18, 18, 19, 22, 27, 49Finding Mode from Length of Stay Data35Here are some data from a study of hospitalized patients. The number represents the number of nights each of 30 patients spent in the hospital. Take a minute to look at the data. Can you identify the mode?

Most values occur once, but2 5s3 9s5 10s3 12s2 18sMode = 10

36Here are the data in a typical Epi Info 6 frequency distribution. You can easily see that the number 10 has the highest frequency, so the mode is 10.Finding Mode from Histogram

37Can see quite clearly with a histogramHistogram shows mode at a glanceMode Sensitive to Outliers?

38As we will see later, some measures of central location are sensitive (are affected by) outliers or extreme values. What if the patient who had stayed 49 days had, in fact, stayed 149 days. Would that affect the mode? [Answer - No.] So the mode is NOT sensitive to outliers.Population024681012141618Bimodal DistributionUnimodal Distribution02468101214161820Population39Depending on the variable, there can be one or multiple modes in a dataset.

Can the class think of any examples?

Can you think of any distributions that may not have a mode? [Answer: when same number of observations at each value]Easiest measure to understand, explain, identifyAlways equals an original valueInsensitive to extreme values (outliers)Good descriptive measure, but poor statistical properties May be more than one modeMay be no modeDoes not use all the dataMode Properties / Uses40So lets summarize what you should know about modes.The mode isEasiest measure to understand, explain, identifyAlways equals an original valueInsensitive to extreme values (outliers)The mode is a perfectly fine descriptive measure -- what is the most common or popular value, but it has poor statistical properties -- we dont do calculations based on the mode.May be more than one modeMay be no modeDoes not use all the data

Definition: Median is the middle value; also, the value that splits the distribution into two equal parts50% of observations are below the median50% of observations are above the median

Method for identification Arrange observations in order Find middle position as (n + 1) / 2 Identify the value at the middleMedian41Medianis the middle value, or the value that splits tieh distribution into two equal parts.One half of its observations are smaller than the median,One half of its observations are larger than the median

To find the median, 1. Arrange observations in order2. Find middle position as (n + 1) / 23. Identify the value at the middle

Median Observation Median: Odd Number of Valuesn = 19 n+12= 19+12= 202=10=Median age = 30 yearsObsAge127227328428528629729829929103011301230133014301531163117321834193642When there is an odd number of values, such as the 19 values shown here, the middle value of the dataset is the Median.

An easy way to find the middle value is to add 1 to the total N of values, divide that by 2. Here, we would take 19+1 = 20, divided by 2 = 10. Therefore the middle value is the 10th observation = 30 years. n = 20ObsAge12722732842852862972982992910301130123013301430153116311732183419362037Median Observation n+12= 20+12= 212=10.5=Median age = Average value between 10th and 11th observationMedian: Even Number of Values 30+30230 years=43When there is an even number of values, such as this series of 20 values, the median is the average of the 2 middle values. This is shown by calculating the median observation as 10.5 half way between 10 and 11. Therefore the median value is the average of the 10th and 11th observation = (30+30)/2 = 30 years.Median at 50% = 10

44Often, we have larger datasets and we analyze the data with a computer. Remember that the median represents the half-way of the data. So the easiest way of looking for the median when you have a frequency distribution like this is to see where the 50% would fall on the cumulative frequency column. Here, 9 hospital days includes 40.0% of the data, and 10 hospital days includes 56.7% of the data, so 50% falls in the 10 hospital day category. 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49Find Median of Length of Stay Data;Is Median Sensitive to Outliers? 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 14945Look at the top box. Here are the length of stay data again.Remember, there are 30 observations. Where is the middle of the distribution? N+30/2 = 31/2 = 15.5, or between the 15th and 16th observations. Both equal 10, so the median equals 10.

Now look at the bottom box. Does the change in the last length of stay affect the median? No. So the median is not sensitive to one or a few outliers or a few extreme values.Does not use all the data availableInsensitive to extreme values (outliers)Good descriptive measure but poor statistical propertiesMeasure of choice for skewed dataEquals an original value of n is oddMedian Properties / Uses46So lets summarize what you should know about medians.The medianDoes not use all the data in the distribution, only 1 or 2 values in the middle.So it is insensitive to extreme values (outliers).Like the mode, the median is a good descriptive measure, but has poor statistical properties. Therefore, the median is not commonly used for additional statistical manipulations.Because the median is indifferent to values in the tails of a distribution, it is the measure of choice for skewed data. We will discuss this again later.Finally, the median equals an original value of n is odd, but it is the average of 2 values if n is even.

Definition: Quartile is the value that splits the distribution into four equal parts

25% of observations are below the first quartile (Q1)25% of observations are between Q1 and Q2 (median)25% of observations are between Q2 (median) and Q325% of observations are above Q3

Quartiles47[Consider skipping or deleting the Quartile slides - rcd]

The median was the value that split the distribution into 2 equal parts.A Quartile is the value that splits the distribution into 4 equal parts.

The first quartile, designated as Q1, has 25% of the observations below it, and 75% above it.The second quartile, Q2, is the median. IT has 50% of the observations below it, and 50% above it.The third quartile, Q3, has 75% of the observations below it, and only 25% above it.

One way to find the quartiles is to first find the Median.Q1 is the median value between the lowest value and Median.Q3 is the median value between the Median and the highest value of the dataset.

ObsAge12722732842852862972982992910301130123013301430153116311732183419362037Q2 age = 30Q2Q1Q3= 5.25 n+14Q1 observation =round 20+14= ~ 5th obsQ1 age = 28= 15.75 3(n+1)4Q3 observation =round 3(20+1)4= ~ 16th obsQ3 age = 31 214= 3(21)4=Q2 observation = 10.5 (median)Quartiles48[Consider skipping or deleting the Quartile slides - rcd]

There are 20 values. We already know that the 2nd quartile is the Median, therefore Q2 = 30. To find the 1st quartile:(N+1)/4, but will round to the nearest whole number Here, the 1st quartile observation = round [(20+1)/4] = 21/4 = 5.25 ~ 5th observationQ1 = 28

3rd quartile:3(N+1)/4, again round to nearest whole number3rd quartile observation = round [3*(20+1)/4] = 63/4 = 15.75 ~ 16th observationQ3 = 31

Value of the variable that splits the distribution in 100 equal parts

35 % of observations are below the 35th percentile

65 % of observations are above 35th percentile

Percentiles49The percentile is the value of the variables that splits the distribution in 100 equal parts. It is a good form of describing data in general.ObsAge12722732842852862972982992910301130123013301430153116311732183419362037Values (Age)FreqPercent (Freq/Total)Cumulative Percent27210%10%28315%25%29420%45%30525%70%31210%80%3215%85%3415%90%3615%95%3715%100%Total20100%25th Percentile90th PercentilePercentiles50Weve transformed the Observation/Age table to a table easier for looking at Percentiles. See that, for example, there are 3 people in the class who are 28. Therefore when we create the 2nd table, we can put in one row the value (age), the frequency of the value (age 28 has a frequency of 3), the percentage of that value among the total population (a frequency of 3/total population of 20 = 15% of the class is age 28), and the cumulative percent of the population up to that value. For example, 10% of the class is age 27 + 15% of the class is age 28. Therefore, 25% of the class is ages 27 to 28. Another way to state this is to use percentiles.

The percentile is the value of the variables which splits the distribution in 100 equal parts. The cumulative percent is based out of 100%. So in this population, another way to define age 28 is to say it is the cut-point of the 25th percentile. This implies that 25% of the values are less than age 28, and 75% of the values are greater than age 28.

How would you describe the 90th percentile?34 is the cut-point of the 90th percentile. Therefore 90% of the class is younger than age 34, and only 10% of the class is older than age 34.Method for identificationSum up all of the valuesDivide the sum by the number of observations (n)Arithmetic MeanArithmetic mean = average value51Now lets move on to the arithmetic mean.The arithmetic mean is what is commonly called the average.

To identify the mean, youMethod for identification1. Sum up all of the values2. Divide the sum by the number of observations (n)

ObsAge12722732842852862972982992910301130123013301430153116311732183419362037n = 20Sxi = 60530.25 20605 ==nx i=Arithmetic Meanxx52The Arithmetic Mean (Mean) is the measure of central location calculated by dividing the sum of the values in a dataset by the number of values in the dataset. It is often used interchangeably with the word average.

The Mean best reflects the typical value of a dataset when there are few outliers and/or the dataset is generally symmetrical.

Here there are 20 values: N=20The sum of the 20 values is equal to 605The mean is the value found by dividing (605/20) = 30.25

0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12,12, 13, 14, 16, 18, 18, 19, 22, 27, 49

Sum = 360n = 30Mean = 360 / 30 = 12 Finding the Mean Length of Stay Data53Mean = 120 12 = -12 9 12 = -312 12 = 02 12 = -10 9 12 = -313 12 = 1 3 12 = -910 12 = -214 12 = 24 12 = -810 12 = -216 12 = 45 12 = -710 12 = -218 12 = 65 12 = -710 12 = -218 12 = 66 12 = -610 12 = -219 12 = 77 12 = -511 12 = -122 12 = 108 12 = -412 12 = 027 12 = 159 12 = -312 12 = 049 12 = 37-71-1788Centering Property of Mean54An interesting property of the mean is that it is the value closest to every other value in the distribution. If you calculated the difference between the mean and every value, and sum up those differences, you get a total of zero! That does not happen with any other value.012345605101520253035404550Nights of stayMean = 12.0Mean = 15.3Mean Uses All Data,So Sensitive to Outliers

55Look at the top distribution. This is the distribution of hospital stays for the 30 people in the study. The mean equals 12. But, as we have discussed before, what if the person who stayed 49 days had actually stayed 149 days? The mean increases dramatically.So the mean IS sensitive to outliers and even a few extreme values.

Tell the class: Now I want you to flip over your notes so you cant see the next slide.

Centered distribution Approximately symmetrical Few extreme values (outliers)OK!When to use the arithmetic mean?56Probably best known measure of central locationUses all of the dataAffected by extreme values (outliers)Best for normally distributed dataNot usually equal to one of the original valuesGood statistical propertiesArithmetic Mean Properties / Uses57So lets summarize what you should know about the arithmetic mean.The medianProbably best known measure of central location.It uses all of the data,So it is affected by extreme values (outliers).Consequently, the mean is best when you have normally distributed (bell-shaped curve) dataNot usually equal to one of the original values (but who cares?)Good statistical properties, so many statistical tests and other techniques are based on the mean.

0000411421431545559569679681069101010 Var AVar BVar CFor each variable,find the: Sum Mean Median Mode Minimum value Maximum value58Var AVar BVar CSum:555555Mean:Median:Mode:Min:Max:For each variable,find the: Sum Mean Median Mode Minimum value Maximum value59Ask class for answers.Sum:555555Mean:555Median:555Mode:1,94,5,6noneMin:000Max:101010 Var AVar BVar CFor each variable,find the: Sum Mean Median Mode Minimum value Maximum value60Comparison of Mode, Median and Mean

Symmetrical:Mode = Median = Mean

Skewed right:Mode < Median < Mean

Skewed left:Mean < Median < Mode61Measures of Central Location SummaryMeasure of Central Location single measure that represents an entire distributionMode most common valueMedian central valueArithmetic mean average valueMean uses all data, so sensitive to outliersMean has best statistical propertiesMean preferred for normally distributed dataMedian preferred for skewed data62Same center

but

different dispersions63Measures of SpreadDefinition: Measures that quantify the variation or dispersion of a set of data from its central location

Also known as:Measure of dispersionMeasure of variation

Common measuresRange Standard errorInterquartile range 95% confidence intervalVariance / standard deviation64RangeProperties / UsesGreatly affected by outliersUsually used with medianDefinition: difference between largest and smallest values65Finding the Range of Length of Stay Data 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12,12, 13, 14, 16, 18, 18, 19, 22, 27, 4966Range = 49 - 0 = 49Range = 149 - 0 = 149Range Sensitive to Outliers?

012345605101520253035404550Nights of stay67Interquartile RangeProperties / UsesUsed with medianFive-number summary for box-and-whiskers diagram:Maximum (100%, largest value)Third quartile (75%)Median (50%)First quartile (25%)Minimum (0%, smallest value)Definition: the central 50% of a distribution68Interquartile Range Length of Stay Data 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, M 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49Q3Q1Q1 = 25th percentile @ (30+1) / 4 = 7 6Median = 50th percentile @ 15.5 10Q3 = 75th percentile @ 3 (30+1) / 4 = 23 14

69Box-and-Whiskers Diagram Length of Stay Data

70Box-and-Whiskers Diagrams Variables A, B, C

71Variance and Standard DeviationVariance= average of squared deviations from mean= Sum (x mean)2 / n-1

Standard deviation= square root of varianceDefinition: measures of variation that quantifies how closely clustered the observed values are to the mean72x : meanxi : valuen : numbersd: variancesd : standard deviationSD =SD =()n-1()n-1Equations for Variance and Standard Deviation

-xxi -xxi73x : meanxi : valuen : numbersd: variancesd : standard deviationCalculate the arithmetic mean

Subtract the mean from each observation.

Square the difference.

Sum the squared differences

Divide the sum of the squared differences by n 1

Take the square root of the variance x()() -xxiSD =n-1() -xxi -xxi -xxiSD = s2

Steps to Calculate Variance and Standard Deviation740 12 = -12 9 12 = -312 12 = 02 12 = -10 9 12 = -313 12 = 1 3 12 = -910 12 = -214 12 = 24 12 = -810 12 = -216 12 = 45 12 = -710 12 = -218 12 = 65 12 = -710 12 = -218 12 = 66 12 = -610 12 = -219 12 = 77 12 = -511 12 = -122 12 = 108 12 = -412 12 = 027 12 = 159 12 = -312 12 = 049 12 = 37-71-1788Centering Property of Mean75An interesting property of the mean is that it is the value closest to every other value in the distribution. If you calculated the difference between the mean and every value, and sum up those differences, you get a total of zero! That does not happen with any other value.Length of Stay Data(0 12)2 = 144 (9 12)2 = 9 (12 12)2 = 0(2 12)2 = 100 (9 12)2 = 9(13 12)2 = 1 (3 12)2 = 81(10 12)2 = 4(14 12)2 = 4(4 12)2 = 64(10 12)2 = 4(16 12)2 = 16(5 12)2 = 49(10 12)2 = 4(18 12)2 = 36(5 12)2 = 49(10 12)2 = 4(18 12)2 = 36(6 12)2 = 36(10 12)2 = 4(19 12)2 = 49 (7 12)2 = 25(11 12)2 = 1(22 12)2 = 100(8 12)2 = 16(12 12)2 = 0(27 12)2 = 225(9 12)2 = 9(12 12)2 = 0(49 12)2 = 1369

Sum = 2448; Var = 2448 / 29 = 84.4; SD = 84 = 9.2

76Standard Deviation Properties / UsesStandard deviation usually calculated only when data are more or less normally distributed (bell shaped curve)

For normally distributed data,68% of the data fall within 1 SD95% of the data fall within 2 SD99% of the data fall within 3 SD

77The standard deviation is usually calculated only when the data are more or less normally distributed. (The length of stay data is skewed to the right, so perhaps that was not the best example to use.)

But for normally distributed data (i.e., bell-shaped curve),68.3% of the data fall within plus/minus 1 SD95.5% of the data fall within plus/minus 2 SD95.0% of the data fall within plus/minus 1.96 SD99.7% of the data fall within plus/minus 3 SD

Standarddeviation2.5%2.5%68%Mean95%Normal Distribution78As illustrated here, The mean is at the center, and 1, 2, and 3 standard deviations are marked on the x axis. For normally distributed data, approximately two-thirds (68.3%, to be exact) of the data fall within one standard deviation of either side of the mean; 95.5% of the data fall within two standard deviations of the mean; and 99.7% of the data fall within three standard deviations. Exactly 95.0% of the data fall within 1.96 standard deviations of the mean.

This is important to us in epidemiology because we can compare any one individuals characteristics ( I can compare my height to the class height, I can compare my blood pressure to the mean value of the nation) or any groups characteristic ( rayons reported cases this week to the nations) with a populations values and determine how dispersed they are from the mean. The standard deviation can be used to set limits as to what is normal , normal birth weight, normal temperature, normal number of reported cases, and then obviously detect when something is not normal, something is rare, (more than 2 or three standard deviations from the mean).

We will use the mean and standard deviation to help us determine what is the normal number reported cases and what is abnormal when we cover surveillance analysis.Match the Measures of Central Location & SpreadModeMedianArithmetic meanStandard deviationRangeInterquartile range79ModeMedianArithmetic meanStandard deviationRangeInterquartile rangeMatch the Measures of Central Location & Spread80Name the AppropriateMeasures of Central Location and SpreadDistributionCentral LocationSpreadSingle peak,Mean*Standard symmetrical deviationSkewed or MedianRange orData with outliersInterquartile range

* Median and mode will be similar81The choice of measures of central location and spread will depend in large part on the nature of the distribution of the observations.

For continuous variables single-peaked and symmetric distribution, the mean, median, and mode will be similar or identical. The mean is usually preferred, and it is paired with the standard deviation.

For data that are skewed or have a few extreme values, the median is the measure of choice because it is insensitive to extreme values. For descriptive purposes, the median is often paired with the range. For comparison purposes, the interquartile range may be used.Properties ofMeasures of Central Location & SpreadFor quantitative / continuous variablesMode simple, descriptive, not always useful Median best for skewed dataArithmetic mean best for normally distributed dataRange use with medianStandard deviation use with meanStandard error used to construct confidence intervals

8202468101214Population1st quartile3rd quartileMinimumMaximumRangeMode MedianInterquartile intervalAge83Measures of ShapesTHE NORMAL DISTRIBUTIONMany variables have a normal distribution. This is a bell shaped curve with most of the values clustered near the mean and a few values out near the tails. MEASURES OF VARIATION

Range is defined as the difference in value between the highest (maximum) and the lowest (minimum) observation

Variance is defined as the sum of the squares of the deviation about the sample mean divided by one less than the total number of items.

Standard deviation it is the square root of the variance

The normal distribution is symmetrical around the mean. The mean, the median and the mode of a normal distribution have the same value. An important characteristic of a normally distributed variable is that 95% of the measurements have value which are approximately within 2 standard deviations (SD) of the mean.ESTIMATIONSThe basic problems to which Statistics are applied in practice arise when trying to deduce something about a population from the evidence provided by a sample of observations taken from that population.

HOW TO DETERMINE THE EXTENT TO WHICH THE SAMPLE REPRESENTS THE POPULATION AS A WHOLE.

To find out to what extent a particular sample value deviates from the population value, a range or an interval around the sample value can be worked out which will most probably contain the population value.

This range or interval is called the CONFIDENCE INTERVAL.

The calculation of a confidence interval takes into account the STANDARD ERROR. The standard error gives an estimate of the degree to which the sample mean varies from the population mean. It is computed on the basis of the standard deviation. The standard error for the mean is calculated by dividing the standard deviation by the square root of the sample size: standard deviation/ Sample size or SD /

95% CONFIDENCE INTERVALWhen describing variables statistically you usually present the calculated sample mean 1.96 times the SE( ).

This is then called the 95% CONFIDENCE INTERVAL. It means that there is 95% probability that the population mean lies within this interval.

Note that the larger the sample size, the smaller the standard error and the narrower the confidence interval will be. Thus the advantage of having a large sample size is that the sample mean will be a better estimate of the population mean.

If the sample size is large, small differences can be significant but a large difference may not achieve statistical significance due to small sample size. This leads us to calculating the Confidence Intervals.The population parameters do not change and remain constant whereas the sample estimates can change and take any random value.

PopulationSample

parameters estimates

Mean

Standard

deviation

SD

Proportion

p

Population

correlation

coefficient

r


Recommended