Download - STAT 211 – 019 Dan Piett West Virginia University Lecture 2.

STAT 211 – 019 Dan Piett

West Virginia University

Lecture 2

Last LecturePopulation/SampleVariable Types

Discrete/Continuous Numeric & Ranked/Unranked Categorical

Displaying Small Sets of NumbersDot Plots, Stem and Leaf, Pie Charts

HistogramsFrequency/Density and Symmetric vs

Right/Left SkewedMeasures of Center

Mean/Median

Overview2.3 Measures of Dispersion2.5 Boxplots3.1 Scatterplots3.2 Correlation3.3 Regression

Section 2.3

Measures of Dispersion

Descriptive StatisticsDescribing the DataHow do we describe data?Graphs (Last Class)Measures

Center (Last Class)Mean/Median

Dispersion/Spread (This Class)Variance, Standard Deviation, IQR

Spread of DataExample: SpreadData 1: 8, 8, 9, 9, 10, 11, 11, 12, 12Data 2: -30, -20, -10, 0, 10, 20, 30, 40 ,50Data 1 – Mean = Median = 10Data 2 – Mean = Median = 10

Both have the same measure of center but how do they differ?

Data 2 is much more spread out.

Sample Standard DeviationSample Standard Deviation (S) is a

measure of how spread out the data is S can be any number >= 0Larger S indicates a larger spreadUnit Associated with S is the same unit as

the variableExample: Mean of 110 lb, Standard Deviation

10 lbThe square of the sample standard

deviation is called the sample variance

Standard Deviation ExampleData 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)

S = 1.58Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)

S = 27.39

As you can see, the standard deviation of Data 2 is much larger than Data 1.

Population Variance/Standard DeviationMuch like the sample mean (xbar)

estimates the population mean (mu), the sample variance/standard deviation (s) can be used to estimate the true population standard deviation (sigma)

Linear Transformations and Changes of ScaleBy adding or subtracting a constant to every

value in a data setThe mean is increased/decreased by the same

amountThe median is increased/decreased by the same

amountThe standard deviation is unchanged

By multiplying each value by a constantThe mean is multiplied by the same amountThe median is multiplied by the same amountThe standard deviation is multiplied by the same

amount

Section 2.5

Boxplots

QuartilesQuartiles are numbers which partition the data

into 4 subgroups (ie 4 quarters in a dollar)Q1

The data separating lowest 25% of the data valuesQ2 aka. Median

The data separating the lowest 50% of the data values

Q3 The data separating the lowest 75% of the data

valuesQ4 aka. Maximum

The largest data value

Quartiles ExampleYou can think of Q1 as the median of the

bottom half of the data and Q3 as the median of the top half of the data

Interquartile Range (IQR)The IQR is another measure of spread,

much like S.Larger IQR results in more spread dataIQR is calculated as Q3 - Q1ExampleData 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)

IQR = 11.5-8.5=3Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)

IQR = 35-(-15) = 50

BoxplotsBoxplots are a graphical representation of

the quartiles.

Using IQR to Find Potential OutliersOne method to find potential outliers is as

follows:1. Find the IQR2. Add 1.5*IQR to Q3

Anything larger than this value can be flagged as a potential outlier

3. Likewise, subtract 1.5*IQR from Q1Anything smaller than this value can be flagged as a

potential outlier

Example Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12) Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)

Section 3.1

Scatterplots

Bivariate DataBivariate data is data consisting of two

variables from the same individualExamples

Height and WeightClasses skipped and GPA

Graphed using a scatterplot

Scatterplot Example

Section 3.2

Correlation

Pearson Correlation CoefficientWe have discussed ways to describe data of

one variable. This section will discuss how to describe two variables on the same individual together.

The correlation coefficient, r, is a measure of the strength of a linear (straight line) relationship between bivariate data. (You will not need to know the formula for r)

To say two variables are correlated is two say that an increase/decrease in one corresponds to an increase/decrease in the other.

More on rr can take on values between -1 and 1The strength of the correlation depends on

how close you are to the extreme values of -1 or 1r = -.78 is a stronger correlation than r = .50

There are three types of correlationPositiveNegativeNo Correlation

Positive CorrelationPositive Correlation exists when r is

between 0 and 1.The closer r is to 1, the stronger the

relationshipThis implies that if you increase one of the

variables, the other one will also increase.Examples:

Height and Weight, Temperature and Ice Cream Sales

Negative CorrelationPositive Correlation exists when r is

between -1 and 0.The closer r is to -1, the stronger the

relationshipThis implies that if you increase one of the

variables, the other one will decrease.Example:

Temperature and Hot Chocolate Sales

No CorrelationNo Correlation exists when r is

approximately 0This implies that if you increase one of the

variables the other one does not changeExample:

Temperature and Cookie Sales

Interpretation of rAlthough we may find that two variables are

correlated, this does not mean that there is necessarily a causal relationship.

Example:High School Teachers who are paid less tend to have

students who do better on the SATs than Teachers who are paid more. It has been found that there is a negative correlation between teacher salary and students SAT scores. Therefore we should pay our teachers less so students score higher.

Clearly this is not a causal relationship. There is likely a third variable, that is explaining this. One possibility may be the age of the teacher.

Section 3.3

Regression

Regression IntroSo we have decided that two variables are

correlated, we are now going to use the value of one of the variables, “x”, to predict the value of the other variable, “y ”.

Example:Use height (x) to predict weight (y)Use temperature (x) to predict ice cream

sales (y)

Regression Equation

Calculating a Regression Equation Given the slope and intercept

Plotting a Regression Line

Notes on Regression Lines

ResidualsA residual is the distance between a point

(observed y-value) and the regression line (predicted y-value)

Formula: Observed Value – Predicted ValueUsing the Cholesterol Example:

For TV Hours = 3, our predicted value was 212.2The actual value on the graph is 220.The residual for this particular point is = 220-

212.2=7.8A residual may be positive or negative

The interpretation is that the observed y-value is 7.8 units larger than the predicted y value for TV Hours = 3