Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | shemar-satterthwaite |
View: | 216 times |
Download: | 1 times |
.Please start yourDaily Portfolio
Introduction to Statistics for the Social Sciences
SBS200, COMM200, GEOG200, PA200, POL200, or SOC200Lecture Section 001, Summer Session II, 2013
9:00 - 11:20am Monday - FridayRoom 312 Social Sciences (Monday – Thursdays)
Room 480 Marshall Building (Fridays)
http://www.youtube.com/watch?v=oSQJP40PcGI
My last name starts with a letter somewhere between
A. A – DB. E – LC. M – RD. S – Z
Please click in
Please double check All cell phones other electronic
devices are turned off and stowed away
Homework due – Wednesday
On class website: Please print and complete homework worksheet #13
Multiple Regression
Schedule of readings
Before Friday
Please read chapters 10 – 14
Please read Chapters 17, and 18 in PlousChapter 17: Social InfluencesChapter 18: Group Judgments and Decisions
Study Guide is
online
Next couple of lectures 7/30/13
Use this as your study guide
Simple and Multiple RegressionUsing correlation for predictions
r versus r2
Regression uses the predictor variable (independent) to make predictions about the predicted variable (dependent)
Coefficient of correlation is name for “r”Coefficient of determination is name for “r2”
(remember it is always positive – no direction info)
Standard error of the estimate is our measure of the variability of the dots around the regression line
(average deviation of each data point from the regression line – like standard deviation)
Coefficient of regression will “b” for each variable (like slope)
Other Problems
The expected frequeny of teeth brushing for having one cavity is
Frequency of teeth brushing= 5.5 + (-.91) Cavities If “Cavities” = 3, what is the prediction for “Frequency of teeth brushing”?
Frequency of teeth brushing= 5.5 + (-.91) Cavities Frequency of teeth brushing= 5.5 + (-.91) (3) Frequency of teeth brushing= 5.5 + (-2.73) = 2.77 (3.0, 2.77)
Prediction lineY’ = a + b1X1
Y-intercept
SlopeIf number of cavities = 3
Frequency of Teeth brushing
will be 2.77
Review
r = - 0.85 b1 = - 0.91(slope)
b0 = 5.5(intercept)
Draw a regression lineand regression equation
Prediction lineY’ = b1X1+ b0
Y’ = (-.91)X 1+ 5.5Review
Correlation - let’s predict how often they brushed their teeth
0 1 2 3 4 5
Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
Find prediction lineY’ = b1 X + b0
Y’ = (-0.91) X + 5.5
Y’ = (-0.91) 1 + 5.5 = 4.59(plot 1,4.59)
Y’ = (-0.91) 5 + 5.5 = 0.95(plot 5,0.95)
Plot line - predict Y’ from X- Pick an X
- Pick another X
Let’s try X of 1
Let’s try X of 5
Review
r = -0.85b1 = - 0.91b0 = 5.5
Y’ = b1 X + b0
Y’ = (-0.91) 3 + 5.5 = 2.77
Y’ = (-0.91) 1 + 5.5 = 4.59
Y’ = (-0.91) 2 + 5.5 = 3.68
Y’ = (-0.91) 3 + 5.5 = 2.77
Y’ = (-0.91) 5 + 5.5 = .95
Y’ = (-0.91) X + 5.5
X Y .
1 53 42 33 25 1
0 1 2 3 4 5
Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
Review
Correlation - Evaluating the prediction line
Does the prediction line perfectlypredict the Ys from the Xs?
No, let’s see
How much “error” is there?Exactly?
Prediction lineY’ = b1X 1+ b0
Y’ = (-.91)X 1+ 5.5
0 1 2 3 4 5Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
Residuals
The green lines show how much “error” there is in our prediction line…how much
we are wrong in our predictions
Correlation
Perfect correlation = +1.00 or -1.00
The more closely the dots approximate a straight line,(the less spread out they are) the stronger the relationship is.
One variable perfectly predicts the other
No variability in the scatterplot
The dots approximate a straight line
AnyResiduals?
0 1 2 3 4 5Number of cavities
5
Num
ber
of ti
mes
per
da
y te
eth
are
brus
hed
1
2
3
4
0
• Shorter green lines suggest better prediction – smaller error
• Longer green lines suggest worse prediction – larger error
• Why are green lines vertical? Remember, we are predicting the variable on the Y axis So, error would be how we are wrong about Y (vertical)
How well does the prediction line predict the Ys from the Xs?
Residuals
A note about curvilinear relationships and patterns
of the residuals
0 1 2 3 4 5Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
• Slope doesn’t give “variability” info• Intercept doesn’t give “variability info
• Correlation “r” does give “variability info
How well does the prediction line predict the Ys from the Xs?
Residuals
• Residuals do give “variability info
What if we want to know the “average deviation score”? Finding the standard error of the estimate (line)
Standard error of the estimate:
• a measure of the average amount of predictive error • the average amount that Y’ scores differ from Y scores
• a mean of the lengths of the green lines
Standard error of the estimate (line)
Sound familiar??
Correlation - let’s predict how often they brushed their teeth
0 1 2 3 4 5
Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
Find prediction lineY’ = b1 X + b0
Y’ = (-0.91) X + 5.5
Y’ = (-0.91) 1 + 5.5 = 4.59(plot 1,4.59)
Y’ = (-0.91) 5 + 5.5 = 0.95(plot 5,0.95)
Plot line - predict Y’ from X- Pick an X
- Pick another X
Let’s try X of 1
Let’s try X of 5
r = -0.85b1 = - 0.91b0 = 5.5
Y’ = b1 X + b0
Y’ = (-0.91) 3 + 5.5 = 2.77
Y’ = (-0.91) 1 + 5.5 = 4.59
Y’ = (-0.91) 2 + 5.5 = 3.68
Y’ = (-0.91) 4 + 5.5 = 1.86
Y’ = (-0.91) 5 + 5.5 = .95
Y’ = (-0.91) X + 5.5
X Y Y’ Y-Y’.
1 5 4.59 0.413 4 2.77 1.232 3 3.68 -0.683 2 2.77 -0.775 1 0.95 0.05
0 1 2 3 4 5
Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
These are our “predicted values” for each X score
A note on
Adding up
deviations
.41
1.23
-.77
-.68
0.05
r = -0.85b1 = - 0.91b0 = 5.5
Y’ = b1 X + b0
Y’ = (-0.91) 3 + 5.5 = 2.77
Y’ = (-0.91) 1 + 5.5 = 4.59
Y’ = (-0.91) 2 + 5.5 = 3.68
Y’ = (-0.91) 4 + 5.5 = 1.86
Y’ = (-0.91) 5 + 5.5 = .95
Y’ = (-0.91) X + 5.5
X Y Y’ Y-Y’. (Y-Y’)2
1 5 4.59 0.41 0.1683 4 2.77 1.23 1.5132 3 3.68 -0.68 0.4623 2 2.77 -0.77 0.5935 1 0.95 0.05 .0025
0 1 2 3 4 5
Number of cavities
Num
ber
of t
imes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
.41
1.23
-.77
-.68
0.05
2.739
2.739
30.95
This is like our average
(or standard) size of our residual “Standard Error
of the Estimate”
Is the regression line better than just guessing the mean of the Y variable?
How much does the information about the relationship actually help?
0 1 2 3 4 5Number of cavities
Num
ber
of ti
mes
per
da
y te
eth
are
brus
hed
1
2
3
4
5
0
5
# of
tim
es
teet
h ar
e br
ushe
d
1
2
3
4
00 1 2 3 4 5
Number of cavities
Which minimizes errorbetter?
How much better does the regression line predict the observed results?
r2 Wo
w!
What is r2?
r2 = The proportion of the total variance in one variable that is predictable by its relationship with the other variable
If mother’s and daughter’s heights are correlated with an r = .8, then what amount
(proportion or percentage) of variance of mother’s height is accounted
for by daughter’s height?
Examples
.64 because (.8)2 = .64
What is r2?
r2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable
If mother’s and daughter’s heights are correlated with an r = .8, then what
proportion of variance of mother’s height
is not accounted for by daughter’s height?
Examples
.36 because (1.0 - .64) = .36or
36% because 100% - 64% = 36%
What is r2?
r2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable
If ice cream sales and temperature are correlated with an r = .5, then what amount (proportion or percentage) of
variance of ice cream sales is accounted for by temperature?
Examples
.25 because (.5)2 = .25
What is r2?
r2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable
If ice cream sales and temperature are correlated with an r = .5, then what amount (proportion or percentage) of variance of
ice cream sales is not accounted for by temperature?
Examples
.75 because (1.0 - .25) = .75or
75% because 100% - 25% = 75%
regression equations
Questions on homework?
the hours worked and weekly pay is a strong positive correlation. This correlation is significant, r(3) = 0.92; p < 0.05
The relationship between
+0.92
positive strong
updown
6.085755.286
y' = 6.0857x + 55.286207.43
85.71.846231 or 84%
84% of the total variance of “weekly pay” is accounted for by “hours worked”
For each additional hour worked, weekly pay will increase by $6.09
400380360340320300
4 85 6 7
Number of Operators
Wai
t Tim
e
280
-.73
The relationship between
wait time and number of operators working is negative and strong. This correlation is not significant, r(3) = 0.73; n.s.
negativestrong
number of operators increase, wait time decreases
458
-18.5
y' = -18.5x + 458
365 seconds
328 seconds
.53695 or 54%
The proportion of total variance of wait time accounted for by number ofoperators is 54%.
For each additional operator added, wait time will decrease by 18.5 seconds
Critical r = 0.878No we do not reject the null
39363330272421
Median Income
Perc
ent o
f BA
s
45 48 51 54 57 60 63 66
0.8875
The relationship between
median income and percent of residents with BA degree is strong and positive. This correlation is significant, r(8) = 0.89; p < 0.05.
positivestrong
median income goes up so does percent of residents who have a BA degree
3.1819
25% of residents
35% of residents.78766 or 78%
The proportion of total variance of % of BAs accounted for by median income is 78%.
For each additional $1 in income, percent of BAs increases by .0005
Percent of residents with a BA degree
108
0.0005
y' = 0.0005x + 3.1819
Critical r = 0.632Yes we reject the null
30272421181512
Median Income
Crim
e R
ate
45 48 51 54 57 60 63 66
-0.6293
The relationship between
crime rate and median income is negative and moderate. This correlation is not significant, r(8) = -0.63; p < n.s. [0.6293 is not bigger than critical of 0.632] .
negativemoderate
median income goes up, crime rate tends to go down
4662.5
2,417 thefts
1,418.5 thefts.396 or 40%
The proportion of total variance of thefts accounted for by median income is 40%.
For each additional $1 in income, thefts go down by .0499
Crime Rate
108
-0.0499
y' = -0.0499x + 4662.5
Critical r = 0.632No we do not reject the null
Example of Simple Regression
The manager of copier company wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold.
What are we predicting?
Correlation: Independent and dependent variables• When used for prediction we refer to the predicted variable as the dependent variable and the predictor variable as the independent variable
Dependent Variable
Independent Variable
Soni
MarkTomSusan
JeffCarlos
Who sold the most copiers?
Who sold the fewest copiers?
Correlation Coefficient – Excel Example
Correlation Coefficient – Excel Example
0.759014
Interpret r = 0.759
• Positive relationship between the number of sales calls and the number of copiers sold.
• Strong relationship
• Remember, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related.
Correlation Coefficient – Excel Example
0.759014
Interpret r = 0.759
• Does this correlation reach significance?
• n = 10, df = 8
• alpha = .05
• Observed r is larger than critical r (0.759 > 0.632) therefore we reject the null hypothesis.
• r (8) = 0.759; p < 0.05
Coefficient of Determination – Excel Example
0.759014
Interpret r2 = 0.576(.7592 = .576)
• we can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls.
• Remember, we lose the directionality of the relationship with the r2
Find Regression Equation – Excel Example
Find Regression Equation – Excel Example
Regression Equation - Example
State the regression equationY’ = a + bxY’ = 18.9476 + 1.1842x
Solve for some value of Y’Y’ = 18.9476 + 1.1842 (20)Y’ = 42.63
If make this many calls
If you probably sell this much
What is the expected number of copiers sold
by a representative who made 20 calls?
Interpret the slopeY’ = 18.9476 + 1.1842x“For each additional sales call made we sell
1.842 more copiers”
Regression Equation - Example
What is the expected number of copiers sold
by a representative who made 40 calls?
Solve for some value of Y’Y’ = 18.9476 + 1.1842 (40)Y’ = 66.3156
If make this many calls
If you probably sell this much
An example for The Standard Error of Estimate
The standard error of estimate measures the scatter, or dispersion, of the observed values around the line of regression
A formula that can be used to compute the standard error:
Standard error of the estimate (line)
Regression Analysis – Least Squares Principle
When we calculate the regression line we try to:• minimize distance between predicted Ys and actual (data) Y points (length of green lines)• remember because of the negative and positive values cancelling each other out we have to square those distance (deviations)• so we are trying to minimize the “sum of squares of the vertical distances between the actual Y values and the predicted Y values”
The Standard Error of Estimate
Step 1: List all the Y data points
The Standard Error of Estimate
Step 1: List all the Y data points
Step 2: Find all the predicted Y’ data points
The Standard Error of Estimate
Step 3: Find deviations
Step 4: Square and add up deviations
Then simply plug in the numbers and solve for the standard error of the estimate
Remember conceptually, this is like the average of the length of those green lines
784.211
10 - 2= 9.901=
Writing Assignment - 5 Questions
2. What is a residual? How would you find it?
1. What is regression used for?• Include and example
3. What is Standard Error of the Estimate (How is it related to residuals?)
4. Give one fact about r2
5. How is regression line like a mean?
Writing Assignment - 5 Questions
Regressions are used to take advantage of relationshipsbetween variables described in correlations. We choose a valueon the independent variable (on x axis) to predict values forthe dependent variable (on y axis).
1. What is regression used for?• Include and example
Writing Assignment - 5 Questions
2. What is a residual? How would you find it?
Residuals are the difference between our predicted y (y’)and the actual y data points. Once we choose a value on ourindependent variable and predict a value for our dependentvariable, we look to see how close our prediction was. Weare measuring how “wrong” we were, or the amount of “error”for that guess.
Y – Y’
Writing Assignment - 5 Questions
3. What is Standard Error of the Estimate (How is it related to residuals?)
The average length of the residualsThe average error of our guessThe average length of the green linesThe standard deviation of the regression line
Writing Assignment - 5 Questions
4. Give one fact about r2
5. How is regression line like a mean?
Correlation - the prediction line
Prediction line
• makes the relationship easier to see(even if specific observations - dots - are removed)
• identifies the center of the cluster of (paired) observations
• identifies the central tendency of the relationship (kind of like a mean)
• can be used for prediction
• should be drawn to provide a “best fit” for the data
• should be drawn to provide maximum predictive (explanatory) power for the data
• should be drawn to provide minimum predictive error
- what is it good for?
r2
Some useful terms
• Regression uses the predictor variable (independent) to make predictions about the predicted variable (dependent)
• Coefficient of correlation is name for “r”• Coefficient of determination is name for “r2”
(remember it is always positive – no direction info)
• Standard error of the estimate is our measure of the variability of the dots around the regression line(average deviation of each data point from the regression line – like standard deviation)
Correlation: Independent and dependent variables
• When used for prediction we refer to the predicted variable as the dependent variable and the predictor variable as the independent variable
Dependent VariableDependent
Variable Independent Variable
Independent Variable
What are we predicting?
What are we predicting?
How many dependent variables?
Multiple regression equations
Prediction line Y’ = b1X 1+ b0
Prediction line Y’ = b1X 1+ b2X 2+ b0
Prediction line Y’ = b1X 1+ b2X 2+ b3X 3+ b0
How many independent variables?
1
How many dependent variables?
1How many independent variables?
3
We can predict amount of crime in a city from • the number of bathrooms in city• the amount spent on education in city• the amount spent on after-school
programs
We can predict amount of crime in a city from • the number of bathrooms in city• the amount spent on education in city
We can predict amount of crime in a city from • the number of bathrooms in city
Multiple regression
• Used to describe the relationship between several independent variables and a dependent variable.
Prediction line Y’ = b1X 1+ b2X 2+ b3X 3+ b0
Can we predict amount of crime in a city from the number of bathrooms and the amount of spent on educationand on after-school programs?
• X1 X2 and X3 are the independent variables.• Y is the dependent variable (amount of crime)• b0 is the Y-intercept• b1 is the net change in Y for each unit change in X1
holding X2 and X3 constant. It is called a regression coefficient.
Multiple regression will use multiple independent variables to predict the single dependent variable
Expenses per year
Ye
arl
yIn
com
e
If you spend this much
You probably make this much
The predicted variable goes on the“Y” axis and is called the dependentvariable.
The predictor variable goes on the“X” axis and is called the independent variable
Dep
ende
nt V
aria
ble
(Pre
dict
ed)
Independent
Variable 1
(Predictor)Independent
Variable 2
(Predictor)
If you spend this much
If you save this much
You probably make this much
14-60
Regression Plane for a 2-Independent Variable Linear Regression Equation
Multiple regression equations
Can use variables to predict • behavior of stock market• probability of accident• amount of pollution in a particular well• quality of a wine for a particular year• which candidates will make best workers
14-62
Can we predict heating cost?
Three variables are thought to relate to the heating costs: (1) the mean daily outside temperature, (2) the number of inches of insulation in the attic, and (3) the age in years of the furnace.
To investigate, Salisbury's research department selected a random sample of 20 recently sold homes. It determined the cost to heat each home last January
Multiple Linear Regression - Example
Multiple Linear Regression - Example
14-64
The Multiple Regression Equation – Interpreting the Regression Coefficients
b1 = The regression coefficient for mean outside temperature
(X1) is -4.583.
The coefficient is negative and shows a negative correlation between heating cost and temperature.
As the outside temperature increases, the cost to heat the home decreases. The numeric value of the regression coefficient provides more information. If we increase temperature by 1 degree and hold the other two independent variables constant, we can estimate a decrease of $4.583 in monthly heating cost.
14-65
The Multiple Regression Equation – Interpreting the Regression Coefficients
b2 = The regression coefficient for mean attic insulation (X2) is -14.831.
The coefficient is negative and shows a negative correlation between heating cost and insulation.
The more insulation in the attic, the less the cost to heat the home. So the negative sign for this coefficient is logical. For each additional
inch of insulation, we expect the cost to heat the home to decline $14.83 per month, regardless of the outside temperature or the age of the furnace.
14-66
The Multiple Regression Equation – Interpreting the Regression Coefficients
b3 = The regression coefficient for mean attic insulation (X3) is 6.101
The coefficient is positive and shows a negative correlation between heating cost and insulation.
As the age of the furnace goes up, the cost to heat the home increases.
Specifically, for each additional year older the furnace is, we expect the cost to increase $6.10 per month.
Applying the Model for Estimation
What is the estimated heating cost for a home if:• the mean outside temperature is 30 degrees,• there are 5 inches of insulation in the attic, and• the furnace is 10 years old?