STAT 211 – 019 Dan Piett
West Virginia University
Lecture 2
Last LecturePopulation/SampleVariable Types
Discrete/Continuous Numeric & Ranked/Unranked Categorical
Displaying Small Sets of NumbersDot Plots, Stem and Leaf, Pie Charts
HistogramsFrequency/Density and Symmetric vs
Right/Left SkewedMeasures of Center
Mean/Median
Overview2.3 Measures of Dispersion2.5 Boxplots3.1 Scatterplots3.2 Correlation3.3 Regression
Section 2.3
Measures of Dispersion
Descriptive StatisticsDescribing the DataHow do we describe data?Graphs (Last Class)Measures
Center (Last Class)Mean/Median
Dispersion/Spread (This Class)Variance, Standard Deviation, IQR
Spread of DataExample: SpreadData 1: 8, 8, 9, 9, 10, 11, 11, 12, 12Data 2: -30, -20, -10, 0, 10, 20, 30, 40 ,50Data 1 – Mean = Median = 10Data 2 – Mean = Median = 10
Both have the same measure of center but how do they differ?
Data 2 is much more spread out.
Sample Standard DeviationSample Standard Deviation (S) is a
measure of how spread out the data is S can be any number >= 0Larger S indicates a larger spreadUnit Associated with S is the same unit as
the variableExample: Mean of 110 lb, Standard Deviation
10 lbThe square of the sample standard
deviation is called the sample variance
Standard Deviation ExampleData 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
S = 1.58Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
S = 27.39
As you can see, the standard deviation of Data 2 is much larger than Data 1.
Population Variance/Standard DeviationMuch like the sample mean (xbar)
estimates the population mean (mu), the sample variance/standard deviation (s) can be used to estimate the true population standard deviation (sigma)
Linear Transformations and Changes of ScaleBy adding or subtracting a constant to every
value in a data setThe mean is increased/decreased by the same
amountThe median is increased/decreased by the same
amountThe standard deviation is unchanged
By multiplying each value by a constantThe mean is multiplied by the same amountThe median is multiplied by the same amountThe standard deviation is multiplied by the same
amount
Section 2.5
Boxplots
QuartilesQuartiles are numbers which partition the data
into 4 subgroups (ie 4 quarters in a dollar)Q1
The data separating lowest 25% of the data valuesQ2 aka. Median
The data separating the lowest 50% of the data values
Q3 The data separating the lowest 75% of the data
valuesQ4 aka. Maximum
The largest data value
Quartiles ExampleYou can think of Q1 as the median of the
bottom half of the data and Q3 as the median of the top half of the data
Interquartile Range (IQR)The IQR is another measure of spread,
much like S.Larger IQR results in more spread dataIQR is calculated as Q3 - Q1ExampleData 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
IQR = 11.5-8.5=3Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
IQR = 35-(-15) = 50
BoxplotsBoxplots are a graphical representation of
the quartiles.
Using IQR to Find Potential OutliersOne method to find potential outliers is as
follows:1. Find the IQR2. Add 1.5*IQR to Q3
Anything larger than this value can be flagged as a potential outlier
3. Likewise, subtract 1.5*IQR from Q1Anything smaller than this value can be flagged as a
potential outlier
Example Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12) Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
Section 3.1
Scatterplots
Bivariate DataBivariate data is data consisting of two
variables from the same individualExamples
Height and WeightClasses skipped and GPA
Graphed using a scatterplot
Scatterplot Example
Section 3.2
Correlation
Pearson Correlation CoefficientWe have discussed ways to describe data of
one variable. This section will discuss how to describe two variables on the same individual together.
The correlation coefficient, r, is a measure of the strength of a linear (straight line) relationship between bivariate data. (You will not need to know the formula for r)
To say two variables are correlated is two say that an increase/decrease in one corresponds to an increase/decrease in the other.
More on rr can take on values between -1 and 1The strength of the correlation depends on
how close you are to the extreme values of -1 or 1r = -.78 is a stronger correlation than r = .50
There are three types of correlationPositiveNegativeNo Correlation
Positive CorrelationPositive Correlation exists when r is
between 0 and 1.The closer r is to 1, the stronger the
relationshipThis implies that if you increase one of the
variables, the other one will also increase.Examples:
Height and Weight, Temperature and Ice Cream Sales
Negative CorrelationPositive Correlation exists when r is
between -1 and 0.The closer r is to -1, the stronger the
relationshipThis implies that if you increase one of the
variables, the other one will decrease.Example:
Temperature and Hot Chocolate Sales
No CorrelationNo Correlation exists when r is
approximately 0This implies that if you increase one of the
variables the other one does not changeExample:
Temperature and Cookie Sales
Interpretation of rAlthough we may find that two variables are
correlated, this does not mean that there is necessarily a causal relationship.
Example:High School Teachers who are paid less tend to have
students who do better on the SATs than Teachers who are paid more. It has been found that there is a negative correlation between teacher salary and students SAT scores. Therefore we should pay our teachers less so students score higher.
Clearly this is not a causal relationship. There is likely a third variable, that is explaining this. One possibility may be the age of the teacher.
Section 3.3
Regression
Regression IntroSo we have decided that two variables are
correlated, we are now going to use the value of one of the variables, “x”, to predict the value of the other variable, “y ”.
Example:Use height (x) to predict weight (y)Use temperature (x) to predict ice cream
sales (y)
Regression Equation
Calculating a Regression Equation Given the slope and intercept
Plotting a Regression Line
Notes on Regression Lines
ResidualsA residual is the distance between a point
(observed y-value) and the regression line (predicted y-value)
Formula: Observed Value – Predicted ValueUsing the Cholesterol Example:
For TV Hours = 3, our predicted value was 212.2The actual value on the graph is 220.The residual for this particular point is = 220-
212.2=7.8A residual may be positive or negative
The interpretation is that the observed y-value is 7.8 units larger than the predicted y value for TV Hours = 3