Statistics & Data Analysis
Course Number B01.1305Course Section 31Meeting Time Wednesday 6-8:50 pm
Regression and Correlation
Professor S. D. Balkin -- March 5, 2003
- 2 -
Class Outline Review of last class
Analyzing bivariate data
Correlation Analysis
Regression Analysis
Discuss Midterm Exam
Professor S. D. Balkin -- March 5, 2003
- 3 -
Review of Last Class Research Hypothesis, or Alternative Hypothesis is what the is trying to prove
• Denoted: Ha
Null Hypothesis is the denial of the research hypothesis. It is what is trying to be disproved• Denoted: H0
Basic Logic1. Assume that H0 is true;2. Calculate the value of the test statistic3. If this value is highly unlikely, reject H0 and support Ha, else, fail to reject H0 due to
lack of evidence4. Calculate p-value to determine significance of hypothesis
We can use the sampling distribution to determine what values of the test statistic are sufficiently unlikely given the null hypothesis
Professor S. D. Balkin -- March 5, 2003
- 4 -
Rejection Region
.72 above 1.645 than more is Y
of valueobserved theif 72:Hreject 0.05For
Y
0
=72+3.948
=0.05
Rejection
Region
1.645an greater th is statistic computed if 72:Hreject 0.05For 0
z
Professor S. D. Balkin -- March 5, 2003
- 5 -
Example: 1-Sample Proportion Test In an experiment to test E.S.P., a subject in one room is asked to state the
color of a card chosen from a deck of 50, well-shuffled cards by an individual in another room
It is unknown to the subject how many red or blue cards are in the deck
The subject identifies 32 cards correctly
Questions:• Setup the testing hypothesis• Calculate the test statistic• Are the results significant at the 0.05 and/or 0.01 level of significance?• Instead…interpret results from computer output
Professor S. D. Balkin -- March 5, 2003
- 6 -
Example: Computer Output
Test and CI for One Proportion
Test of p = 0.5 vs p > 0.5
ExactSample X N Sample p 95.0% Lower Bound P-Value1 32 50 0.640000 0.514231 0.032
Linear Regression and Correlation Methods
Chapter 11
Professor S. D. Balkin -- March 5, 2003
- 8 -
Chapter Goals Introduction to Bivariate Data Analysis
• Introduction to Simple Linear Regression Analysis• Introduction to Linear Correlation Analysis
Interpret scatterplots Understand linear models and parameter estimation Identify high-leverage observations Understand the use of transformations Perform hypothesis tests and confidence intervals for linear model and
correlation coefficient
Professor S. D. Balkin -- March 5, 2003
- 9 -
Motivating Example Before a pharmaceutical sales rep can speak about a product
to physicians, he must pass a written exam
An HR Rep designed such a test with the hopes of hiring the best possible reps to promote a drug in a high potential area
In order to check the validity of the test as a predictor of weekly sales, he chose 5 experienced sales reps and piloted the test with each one
The test scores and weekly sales are given in the following table:
Professor S. D. Balkin -- March 5, 2003
- 10 -
Motivating Example (cont)
SALESPERSON TEST SCORE WEEKLY SALES
JOHN 4 $5,000
BRENDA 7 $12,000
GEORGE 3 $4,000
HARRY 6 $8,000
AMY 10 $11,000
Professor S. D. Balkin -- March 5, 2003
- 11 -
Introduction to Bivariate Data Up until now, we’ve focused on univariate data
Analyzing how two (or more) variables “relate” is very important to managers• Prediction equations• Estimate uncertainty around a prediction• Identify unusual points• Describe relationship between variables
Visualization• Scatterplot
Professor S. D. Balkin -- March 5, 2003
- 12 -
Scatterplot
Do Test Score and Weekly Sales appear related?
3 4 5 6 7 8 9 10
4000
6000
8000
1000
012
000
score
sale
s
Professor S. D. Balkin -- March 5, 2003
- 13 -
Correlation
Boomers' Little Secret Still Smokes Up the ClosetJuly 14, 2002
…Parental cigarette smoking, past or current, appeared to have a stronger correlation to children's drug use than parental marijuana smoking, Dr. Kandel said. The researchers concluded that parents influence their children not according to a simple dichotomy — by smoking or not smoking — but by a range of attitudes and behaviors, perhaps including their style of discipline and level of parental involvement. Their own drug use was just one component among many…
A Bit of a Hedge to Balance the Market SeesawJuly 7, 2002
…Some so-called market-neutral funds have had as many years of negative returns as positive ones. And some have a high correlation with the market's returns…
Professor S. D. Balkin -- March 5, 2003
- 14 -
Correlation Analysis Statistical techniques used to measure the strength of the
relationship between two variables
Correlation Coefficient: describes the strength of the relationship between two sets of variables• Denoted r• r assumes a value between –1 and +1• r = -1 or r = +1 indicates a perfect correlation• r = 0 indicates not relationship between the two sets of variables• Direction of the relationship is given by the coefficient’s sign• Strength of relationship does not depend on the direction• r means LINEAR relationship ONLY
Against All Odds: Correlation
Professor S. D. Balkin -- March 5, 2003
- 15 -
Example Correlations
-2 -1 0 1 2
-2-1
01
2r=-0.9
-2 -1 0 1 2
-1.0
0.0
1.0
r=-0.73
-2 -1 0 1 2
-1.0
0.0
1.0
r=-0.25
-2 -1 0 1 2
-1.0
0.0
0.5
1.0
1.5
r=0.34
-2 -1 0 1 2
-2-1
01
r=0.7
-2 -1 0 1 2-3
-2-1
01
23
r=0.88Correlation Demo
Professor S. D. Balkin -- March 5, 2003
- 16 -
Scatterplot
r = 0.88
3 4 5 6 7 8 9 10
4000
6000
8000
1000
012
000
score
sale
s
Professor S. D. Balkin -- March 5, 2003
- 17 -
Does Correlation Imply Causation??
Professor S. D. Balkin -- March 5, 2003
- 18 -
Correlation and Causation Must be very careful in interpreting correlation coefficients
Just because two variables are highly correlated does not mean that one causes the other • Ice cream sales and the number of shark attacks on swimmers are correlated• The miracle of the "Swallows" of Capistrano takes place each year at the Mission San
Juan Capistano, on March 19th and is accompanied by a large number of human births around the same time
• The number of cavities in elementary school children and vocabulary size have a strong positive correlation.
To establish causation, a designed experiment must be run
CORRELATION DOES NOT IMPLY CAUSATION
Against All Odds: Causation
Professor S. D. Balkin -- March 5, 2003
- 19 -
Testing the Significance of the Correlation Coefficient
When calculating a correlation coefficient, an obvious question arises: Is the implied relationship statistically significant, or due to random chance?
We can perform a hypothesis test testing whether there is significant evidence against the correlation coefficient being zero
0:0:0
AHH
Professor S. D. Balkin -- March 5, 2003
- 20 -
Example
Correlations: Score, Sales
Pearson's product-moment correlation
data: score and sales t = 3.1751, df = 3, p-value = 0.05028alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.01947053 0.99189749 sample estimates: cor 0.8778762
What can we determine about this correlation coefficient?
Professor S. D. Balkin -- March 5, 2003
- 21 -
The Regression EffectSir Francis Galton (1822-1911) studied a data set of 1,078 heights of fathers and sons. His data
set is on the next page. Two lines are superimposed: The dashed line is a 45-degree line, shifted up by one inch since on the average the sons were one inch taller than the fathers. The solid line is one that better fits the data set. (It’s the least squares line, to be described soon.)
Galton observed that tall fathers tend to have tall sons but the sons are not, on average, as tall as the fathers. Also, short fathers have short sons who, however, are not as short on average as their fathers. Galton called this effect “regression to the mean”.
In other words, the son’s height tends to be closer to the overall mean height than the father’s height was. Nowadays, the term “regression” is used more generally in statistics to refer to the process of fitting a line to data.
Professor S. D. Balkin -- March 5, 2003
- 22 -
The Regression Effect (cont)
Professor S. D. Balkin -- March 5, 2003
- 23 -
Regression Analysis Simple Regression Analysis is predicting one variable
from another• Past data on relevant variables are used to create and evaluate a
prediction equation
Variable being predicted is called the dependent variable
Variable used to make prediction is an independent variable
Professor S. D. Balkin -- March 5, 2003
- 24 -
Introduction to Regression Predicting future values of a variable is a crucial management
activity• Future cash flows• Needs to raw materials into a supply chain• Future personnel or real estate needs
Explaining past variation is also an important activity• Explain past variation in demand for services• Impact of an advertising campaign or promotion
Against All Odds: Describing Relationships
Professor S. D. Balkin -- March 5, 2003
- 25 -
Introduction to Regression (cont.)
Prediction: Reference to future values Explanation: Reference to current or past values Simple Linear Regression: Single independent variable
predicting a dependent variable• Independent variable is typically something we can control• Dependent variable is typically something that is linearly related to the
value of the independent variable
xy 10ˆˆˆ
Professor S. D. Balkin -- March 5, 2003
- 26 -
Introduction to Regression (cont.)
Basic Idea: Fit a straight line that relates dependent variable (y) and independent variable (x)
Linearity Assumption: Slope of the equation does not change as x change
Assuming linearity we can write
which says that Y is made up of a predictable part (due to X) and an unpredictable part
Coefficients are interpreted as the true, underlying intercept and slope
xy 10
Professor S. D. Balkin -- March 5, 2003
- 27 -
Regression Assumptions
We start by assuming that for each value of X, the correspondingvalue of Y is random, and has a normal distribution.
Professor S. D. Balkin -- March 5, 2003
- 28 -
Formal Assumptions of Regression Analysis
),0(~ d;distributenormally are errors The 4.
othereach oft independen are errors The 3.
;)( variance;same thehave all errors The 2.
;0)( that so linear,fact in isrelation The 1.
2
NORMAL
iVar
iE
i
i
i
Professor S. D. Balkin -- March 5, 2003
- 29 -
Which Line?
There are many good fitting lines through these points
Regression by Eye Applet
3 4 5 6 7 8 9 10
4000
6000
8000
1000
012
000
score
sale
s
Professor S. D. Balkin -- March 5, 2003
- 30 -
Least Squares Principle This method gives a best-fitting straight line by minimizing the
sum of the squares of the vertical deviations about the line
Regression Coefficient Interpretations:• 0: Y-Intercept; estimated value of Y when X = 0
• 1: Slope of the line; average change in predicted value of Y for each change of one unit in the independent variable X
Professor S. D. Balkin -- March 5, 2003
- 31 -
Least Square Estimates
iiixy
iixx
xx
xy
yyxxS
xxS
where
xySS
))((
)(
ˆˆ;ˆ
2
101
Professor S. D. Balkin -- March 5, 2003
- 32 -
Back to the Example
3 4 5 6 7 8 9 10
4000
6000
8000
1000
012
000
x
y
y = 1133.33 x + 1199.99
simple.lm(score, sales)
Professor S. D. Balkin -- March 5, 2003
- 33 -
Standard Error of Estimate Note that not all the points lie on the regression line If they were all on the line, there would be no prediction error Need a measure to indicate how precise the predict of Y is
based on X Standard Error of Estimate: Measures the scatter of the
observed values around the regression line
2)ˆ( 2
nyy
s
Professor S. D. Balkin -- March 5, 2003
- 34 -
Back to the Example
Professor S. D. Balkin -- March 5, 2003
- 35 -
Affecting Parameter Estimation High Leverage Points: Points that have very high or very low values of
the independent variable• Carry great weight in estimation of the slope
High Influence Point: A high leverage point that also happens to correspond to a y outlier• Alter slope and twist line
Outlier: Points that are in the range of the independent variable but vertically distant from the rest of the observations• Inflate the y-intercept term
Most statistical packages provide automatic diagnostics
http://www.stat.sc.edu/~west/javahtml/Regression.html
Professor S. D. Balkin -- March 5, 2003
- 36 -
Checking Model Assumptions (cont.)
Residuals vs. fitted: Look for trend in spread around y = 0
Normal plot: Check if residuals are normally distributed
Scale-Location: Tallest points are the largest residuals
Cook’s Distance: Identifies influential points
Professor S. D. Balkin -- March 5, 2003
- 37 -
Checking Model Assumptions (cont.)
6000 8000 10000 12000
-200
00
1000
3000
Fitted values
Res
idua
ls
Residuals vs Fitted4
5
2
-1.0 -0.5 0.0 0.5 1.0
-1.5
-0.5
0.5
1.5
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q plot4
5
2
6000 8000 10000 12000
0.0
0.4
0.8
1.2
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale-Location plot4
5
2
1 2 3 4 5
0.0
1.0
2.0
3.0
Obs. number
Coo
k's
dist
ance
Cook's distance plot5
4
1
Professor S. D. Balkin -- March 5, 2003
- 38 -
Inferences for Estimates The slope, intercept, and residual standard deviation in a
simple regression model are estimates based on limited data
Just like all other statistical quantities, they are affected by random error
We can apply ideas of hypothesis tests and confidence intervals to the regression estimates
Professor S. D. Balkin -- March 5, 2003
- 39 -
t Test of H0: 1=0
2/0
0
0
1
1
1
10
|| if Reject 3. if Reject 2. if Reject 1.
,error I Type and freedom of degrees 2-nFor :RegionRejection
/0ˆ
:StatisticTest
0 3. 0 2. 0 1. :
0:
ttHttH
ttH
Sst
HH
xx
A
Professor S. D. Balkin -- March 5, 2003
- 40 -
Back to the Example
Regression Analysis: Sales versus Score> summary(fit)
Call:lm(formula = y ~ x)
Residuals: 1 2 3 4 5 -6.000e+02 -7.333e+02 2.558e-13 2.867e+03 -1.533e+03
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1200.0 2313.2 0.519 0.6398 x 1133.3 356.9 3.175 0.0503 .---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1955 on 3 degrees of freedomMultiple R-Squared: 0.7707, Adjusted R-squared: 0.6942 F-statistic: 10.08 on 1 and 3 DF, p-value: 0.05028
Professor S. D. Balkin -- March 5, 2003
- 41 -
Coefficient of Determination R2: Percentage of the variation in the dependent variable that is explained
by the variation in the independent variable
R2 takes on a value between 0 and 100% inclusive
Used to assess “how good” the model is
Residual standard error: 1955 on 3 degrees of freedomMultiple R-Squared: 0.7707, Adjusted R-squared: 0.6942 F-statistic: 10.08 on 1 and 3 DF, p-value: 0.05028
Professor S. D. Balkin -- March 5, 2003
- 42 -
Example
Professor S. D. Balkin -- March 5, 2003
- 43 -
Example It is well known that the more beer you drink, the more your
blood alcohol level rises. However, the extent to how much it rises per additional
beer is not clear.
Calculate the correlation coefficient Perform a regression analysis
Student 1 2 3 4 5 6 7 8 9 10Beers 5 2 9 8 3 7 3 5 3 5BAL 0.100 0.030 0.190 0.120 0.040 0.095 0.070 0.060 0.020 0.050
Professor S. D. Balkin -- March 5, 2003
- 44 -
Homework Hildebrand/Ott
• 11.10• 11.11• 11.12• 11.26• 11.44• 11.45• 11.46• 11.47
• Read Chapter 12
Verzani