chapter 3 describing relationships new.notebook
1
October 31, 2016
A response variable measures an outcome of study. > dependent variables
An explanatory variable attempts to explain the observed outcomes.
> independent variables
The response variable depends on the explanatory variable.
Example: We think that car weight helps explain accident deaths.
Explanatory variable: car weight
Response variable: accident death rate
chapter 3 describing relationships new.notebook
2
October 31, 2016
A scatterplot is the most effective way to display the relationship between two quantitative variables measured on the same individuals. • Values of one variable appear on the horizontal axis and
values of the other variable appear on the vertical axis. • Each individual in the data appears as a point in the graph.• Always plot the explanatory variable (if there is one) on the
horizontal axis (x-axis). If there is no explanatory-response distinction, either variable can go on the horizontal axis.
chapter 3 describing relationships new.notebook
3
October 31, 2016
chapter 3 describing relationships new.notebook
4
October 31, 2016
chapter 3 describing relationships new.notebook
5
October 31, 2016
Examining a ScatterplotIn any graph of data, look for the overall pattern and for striking deviations from that pattern. • You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship.
• An important kind of deviation is an outlier, an individual that falls outside of the overall pattern of the relationship.
Form: shape of scatterplot
chapter 3 describing relationships new.notebook
6
October 31, 2016
Interpret the scatterplot to the right.
Direction: Decreases from left to right. The higer percentage of people taking the SAT, the lower the mean math score was. There is a negative association. Form: The relationship is slightly curved. Clusters/gaps In about half the states, less than 25% took the SAT, and the other half more than 40% took it.
Strength: Moderately strong. States with similar percentage of people taking the SAT tend to have similar mean math scores.
Outliers: There appears to be two outliers: (20, 500) and (88, 460).
chapter 3 describing relationships new.notebook
7
October 31, 2016
Describe what the scatterplot reveals about the relationship between body weight and backpack weight. (Direction, Form, Strength, Outliers)
*Hint: First describe the general pattern. Then identify any deviations from the pattern.
chapter 3 describing relationships new.notebook
8
October 31, 2016
Positive Association, Negative AssociationTwo variables are positively associated when aboveaverage values of one tend to accompany aboveaverage values of the other. And below average values also tend to occur together.
Examples:
Positive Association: Backpack weight generally increases as body weight increases
Negative Association: The mean SAT score goes down as the percent of graduates taking the test increases.
chapter 3 describing relationships new.notebook
9
October 31, 2016
Thursday Oct. 20th
chapter 3 describing relationships new.notebook
10
October 31, 2016
Tuesday October 27th1. You have data for many years on the average price of a barrel of oil and the average retail price of a gallon of gas. If you want to see how well the price of oil predicts the price of gas, then you should make a scatterplot with _______ as the explanatory variable.
a) the price of oilb) the price of gasc) the year d) either oil price or gas pricee) time
2. A study was designed to determine if smoking influences life expectancy. What will the explanatory and response variables in this study be?
chapter 3 describing relationships new.notebook
11
October 31, 2016
1. Describe the direction of the relationship. Explain why this makes sense.
2. What form does the relationship take? Why are there two clusters of points?
chapter 3 describing relationships new.notebook
12
October 31, 2016
1. Describe the direction of the relationship. Explain why this makes sense.
Positive Association. The longer the duration, the longer the interval.
2. What form does the relationship take? Why are there two clusters of points?
Roughly linear. There are two clusters around 2 and 4.5, Most eruptions fall into two categories shorter (around 2 minutes) and longer (around 4.5 minutes).
3. How strong is the relationships? justify your answer.Fairly strong. The points don't deviate from a linear form too much.
4. Are there any outliers? There are a couple that could be but for the most part they are all in the overall pattern.
5. What information does the family need to predict when the next eruption will occur?
The duration of the previous eruption.
chapter 3 describing relationships new.notebook
13
October 31, 2016
The two scatterplots above show the same data set using two different scales. Since it's easy to be fooled by different scales or amount of space around points in a scatterplot, we need a numerical measure to supplement the graph.
chapter 3 describing relationships new.notebook
14
October 31, 2016
CorrelationThe correlation (r) measures the direction and strength of the linear relationship between two quantitative variables
chapter 3 describing relationships new.notebook
15
October 31, 2016
CorrelationThe correlation (r) measures the direction and strength of the linear relationship between two quantitative variables
Suppose that we have data on variables x and y for n individuals. The values for the first individual are x1 and y1, the values for the second individual are x2 and y2. The mean and standard deviations of the two variables are and for the x values, and and for the y values. The correlation between x and y is:
chapter 3 describing relationships new.notebook
16
October 31, 2016
CorrelationThe correlation (r) measures the direction and strength of the linear relationship between two quantitative variables
n = sample size
summation: "add these terms for all individuals"
mean of x valuesmean of y values
standard deviation of xvalues
standard deviation of yvalues
the x and y values for the term.
chapter 3 describing relationships new.notebook
17
October 31, 2016
chapter 3 describing relationships new.notebook
18
October 31, 2016
Interpreting Correlation1. r is always a number between 1 and 1. r > 0 indicates a positive
association and r < 0 indicates a negative association. r values near 0 indicate a very weak linear relationship. r = 1 and r = 1 only occur in the case of a perfect linear relationship where all points lie exactly on the line.
2. Since r uses the standardized values of the observations, r does not change when we change units of measurements of x, y, or both.
3. Correlation makes no distinction between explanatory and response variables. (Doesn't matter which variable you call x, which you call y)
4. Correlation, r, has no unit of measurement.5. Correlation does not describe curved relationships between
variables, only linear relationships. A correlation of 0 doesn't guarantee that there's no relationship, just that there's no linear relationship.
6. Correlation is not resistant: r is strongly affected by a few outlying observations.
7. Correlation is not a complete summary of twovariable data.
chapter 3 describing relationships new.notebook
19
October 31, 2016
chapter 3 describing relationships new.notebook
20
October 31, 2016
Wednesday October 28th
1. The following scatter plot shows reading test scores against IQ test scores for 14 fifth grade students. There is one outlier in the plot, what are the scores for that child? 2. In a scatterplot of the average price of a barrel of oil and the average retail price of a gallon of gas, you expect to see...
chapter 3 describing relationships new.notebook
21
October 31, 2016
chapter 3 describing relationships new.notebook
22
October 31, 2016
Least Squares Regression is a method for finding a line that summarizes the relationship between two variables.
• A regression line is a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes.
• A regression line is often used to predict the value of y for a given x value. Regression, unlike correlation, requires that you have an explanatory and a response variable.
• A regression line is a model for the data
chapter 3 describing relationships new.notebook
23
October 31, 2016
(yhat): the predicted value of the response variable y for a given value of the explanatory variable x.
the yintercept, the predicted value of y when x=0.
The slope, the amount by which y is predicted to change when x increases by one unit.
chapter 3 describing relationships new.notebook
24
October 31, 2016
Everyone knows that cars and trucks lose value the more they are driven. Can we predict the pice of a used Ford F150 SuperCrew 4x4 if we know how many miles it has on the odometer? A random sample of 16 used F150s was selected from among those listed for sale at autotrader.com. The number of miles driven and price (in dollars) were recorded for each of the trucks, here's the data:
chapter 3 describing relationships new.notebook
25
October 31, 2016
Example 1: Identify the slope and yintercept from the regression line and interpret each value in the context.
chapter 3 describing relationships new.notebook
26
October 31, 2016
Example 1: Identify the slope and yintercept from the regression line and interpret each value in the context.
chapter 3 describing relationships new.notebook
27
October 31, 2016
Back to the Ford F150 problem...
Example 1: How much would a Ford F150 be worth if it has 100,000 miles on it?
chapter 3 describing relationships new.notebook
28
October 31, 2016
Example 2: How much would a Ford F150 be worth if it has 300,000 miles on it?
chapter 3 describing relationships new.notebook
29
October 31, 2016
Monday October 24th
The distribution of scores on the Chapter 2 Test are as follows:
89, 88, 79, 89, 58, 84, 95, 79, 93, 92, 91, 94, 70, 93, 92, 87, 91, 73, 50, 91
What measure of center and spread would you choose to describe the data?
Which is higher, the median or the mean?
Graph the data and describe the distribution (SOCS).
chapter 3 describing relationships new.notebook
30
October 31, 2016
chapter 3 describing relationships new.notebook
31
October 31, 2016
Example 3: Find and interpret the residual for the Ford F150 that had 70,583 miles driven and a price of $21,994?
chapter 3 describing relationships new.notebook
32
October 31, 2016
The least squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
chapter 3 describing relationships new.notebook
33
October 31, 2016
Facts about Residual PlotsA residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess whether a linear model is appropriate. • The mean of the least squares residuals is always zero. • A residual plot in effect turns the regression line horizontal. It magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns.
• If the regression line captures the overall pattern of the data, there should be no pattern in the residuals.
chapter 3 describing relationships new.notebook
34
October 31, 2016
Examining a Residual Plot1. The residual plot should show no
obvious patterns. Ideally it would look like the plot to the right.
2. A curved pattern in a residual plot shows that the relationship is NOT linear.
3. The residuals should be relatively small in size.
4. Increasing or decreasing spread bout the line as x increases indicates that a prediction of y will be less accurate for largers x values.
5. Individual points with large residuals are outliers because they lie far from the line that describes the overall pattern.
6. Individual points that are extreme in the direction of x may not have large residuals, but can be important.
chapter 3 describing relationships new.notebook
35
October 31, 2016
An outlier is an observation that lies outside the overall pattern of the other observations.
An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least squares regression line.
chapter 3 describing relationships new.notebook
36
October 31, 2016
Tuesday October 25thSome data was collected on the weight of a male lab rat for the first 25 weeks after its birth. A scatterplot of the weight (in grams) and time since birth (in weeks) shows a fairly strong, positive linear relationship. The linear regression equation models the data fairly well:
1. What is the slope of the regression line? Explain what it means in context.
2. What is the y‐intercept? Explain what it means in context.
3. Predict the rat's weight after 16 weeks. Show your work.
4. Should you use the line to predict the rat's weight at age 2 years?
chapter 3 describing relationships new.notebook
37
October 31, 2016
chapter 3 describing relationships new.notebook
38
October 31, 2016
chapter 3 describing relationships new.notebook
39
October 31, 2016
chapter 3 describing relationships new.notebook
40
October 31, 2016
Standard Deviation of the Residuals
The average prediction error (or the mean of the residuals) is 0 whenever we use the least squares regression line. That's because the positive and negative residuals "balance out". But that doesn't tell us how far off the predictions are, on average.
So, we can say that our predictions are "off" by an average of _____.
This value gives the approximate size of a "typical" or "average" prediction error (residual)
chapter 3 describing relationships new.notebook
41
October 31, 2016
The coefficient of determination: (or "rsq")The coefficient of determiniation, is the fraction of the variation in the values of y that is accounted for by the least squares regression line of y on x. (tells us how well the least squares regression line predicts values of the response variable y)
We can calcluate using the following formula:measures the total variation in the yvalues.
is the sum of the squared errors
The ratio tells us what proportion of the total variation in y still remains after using the regression line the predict the values of the response variable.
*The least squares regression line accounts for _____ % of the variation in [response variable name].
chapter 3 describing relationships new.notebook
42
October 31, 2016
chapter 3 describing relationships new.notebook
43
October 31, 2016
slope=1.109; For every 1 mpg in the city, the hwy mpg is predicted to increase by 1.109 mpg.
yint=4.62; when the city mileage is zero, we predict a hwy mileage of 4.62 mpg.
chapter 3 describing relationships new.notebook
44
October 31, 2016
Here's a residual plot for the least squares regression of pack weight on body weight for the 8 hikers.
chapter 3 describing relationships new.notebook
45
October 31, 2016
Tuesday November 3rdCreate a residual plot of the F150 data
chapter 3 describing relationships new.notebook
46
October 31, 2016
chapter 3 describing relationships new.notebook
47
October 31, 2016
1. Calculate the standard deviation of the residuals for the F150 problem. Interpret what it means in the context.
2. Calculate the coefficient of determination and interpret what it means in the context.
Monday Oct. 31st Refer to pg. 165 for data table
chapter 3 describing relationships new.notebook
48
October 31, 2016
We can give the equation of the leastsquares regression line in terms of the means and standard deviations of the two variables and their correlation.
where and
We know that every least squares regression line passes through the point .
chapter 3 describing relationships new.notebook
49
October 31, 2016
chapter 3 describing relationships new.notebook
50
October 31, 2016
With all data:
Excluding Child 18:
Excluding Child 19:
6. What do you notice?
chapter 3 describing relationships new.notebook
51
October 31, 2016
Removing child 18 has a strong influence on the position of the regression line. However, removing child 19 has little effect on the regression line.
A point that is extreme in the x direction with no other points near it pulls the line toward itself. We call these points influential.
chapter 3 describing relationships new.notebook
52
October 31, 2016
Recall: The coefficient of determination:
How to interpret: "The least squares regression line accounts for _____ % of the variation in [response variable name]."
Standard Deviation of the Residuals: This value gives the approximate size of a "typical" or "average" prediction error (residual)
How to interpret: "Our predictions are "off" by an average of ______ [response variable name]."
chapter 3 describing relationships new.notebook
53
October 31, 2016
Bottom Line:
Association does NOT imply causation!