+ All Categories
Home > Documents > Statistics & Data Analysis

Statistics & Data Analysis

Date post: 25-Feb-2016
Category:
Upload: dorit
View: 31 times
Download: 0 times
Share this document with a friend
Description:
Statistics & Data Analysis. Course NumberB01.1305 Course Section31 Meeting TimeWednesday 6-8:50 pm. Regression and Correlation. Class Outline. Review of last class Analyzing bivariate data Correlation Analysis Regression Analysis Discuss Midterm Exam. Review of Last Class. - PowerPoint PPT Presentation
44
Statistics & Data Analysis Course Number B01.1305 Course Section 31 Meeting Time Wednesday 6-8:50 pm Regression and Correlation
Transcript
Page 1: Statistics & Data Analysis

Statistics & Data Analysis

Course Number B01.1305Course Section 31Meeting Time Wednesday 6-8:50 pm

Regression and Correlation

Page 2: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 2 -

Class Outline Review of last class

Analyzing bivariate data

Correlation Analysis

Regression Analysis

Discuss Midterm Exam

Page 3: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 3 -

Review of Last Class Research Hypothesis, or Alternative Hypothesis is what the is trying to prove

• Denoted: Ha

Null Hypothesis is the denial of the research hypothesis. It is what is trying to be disproved• Denoted: H0

Basic Logic1. Assume that H0 is true;2. Calculate the value of the test statistic3. If this value is highly unlikely, reject H0 and support Ha, else, fail to reject H0 due to

lack of evidence4. Calculate p-value to determine significance of hypothesis

We can use the sampling distribution to determine what values of the test statistic are sufficiently unlikely given the null hypothesis

Page 4: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 4 -

Rejection Region

.72 above 1.645 than more is Y

of valueobserved theif 72:Hreject 0.05For

Y

0

=72+3.948

=0.05

Rejection

Region

1.645an greater th is statistic computed if 72:Hreject 0.05For 0

z

Page 5: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 5 -

Example: 1-Sample Proportion Test In an experiment to test E.S.P., a subject in one room is asked to state the

color of a card chosen from a deck of 50, well-shuffled cards by an individual in another room

It is unknown to the subject how many red or blue cards are in the deck

The subject identifies 32 cards correctly

Questions:• Setup the testing hypothesis• Calculate the test statistic• Are the results significant at the 0.05 and/or 0.01 level of significance?• Instead…interpret results from computer output

Page 6: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 6 -

Example: Computer Output

Test and CI for One Proportion

Test of p = 0.5 vs p > 0.5

ExactSample X N Sample p 95.0% Lower Bound P-Value1 32 50 0.640000 0.514231 0.032

Page 7: Statistics & Data Analysis

Linear Regression and Correlation Methods

Chapter 11

Page 8: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 8 -

Chapter Goals Introduction to Bivariate Data Analysis

• Introduction to Simple Linear Regression Analysis• Introduction to Linear Correlation Analysis

Interpret scatterplots Understand linear models and parameter estimation Identify high-leverage observations Understand the use of transformations Perform hypothesis tests and confidence intervals for linear model and

correlation coefficient

Page 9: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 9 -

Motivating Example Before a pharmaceutical sales rep can speak about a product

to physicians, he must pass a written exam

An HR Rep designed such a test with the hopes of hiring the best possible reps to promote a drug in a high potential area

In order to check the validity of the test as a predictor of weekly sales, he chose 5 experienced sales reps and piloted the test with each one

The test scores and weekly sales are given in the following table:

Page 10: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 10 -

Motivating Example (cont)

SALESPERSON TEST SCORE WEEKLY SALES

JOHN 4 $5,000

BRENDA 7 $12,000

GEORGE 3 $4,000

HARRY 6 $8,000

AMY 10 $11,000

Page 11: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 11 -

Introduction to Bivariate Data Up until now, we’ve focused on univariate data

Analyzing how two (or more) variables “relate” is very important to managers• Prediction equations• Estimate uncertainty around a prediction• Identify unusual points• Describe relationship between variables

Visualization• Scatterplot

Page 12: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 12 -

Scatterplot

Do Test Score and Weekly Sales appear related?

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

score

sale

s

Page 13: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 13 -

Correlation

Boomers' Little Secret Still Smokes Up the ClosetJuly 14, 2002

…Parental cigarette smoking, past or current, appeared to have a stronger correlation to children's drug use than parental marijuana smoking, Dr. Kandel said. The researchers concluded that parents influence their children not according to a simple dichotomy — by smoking or not smoking — but by a range of attitudes and behaviors, perhaps including their style of discipline and level of parental involvement. Their own drug use was just one component among many…

A Bit of a Hedge to Balance the Market SeesawJuly 7, 2002

…Some so-called market-neutral funds have had as many years of negative returns as positive ones. And some have a high correlation with the market's returns…

Page 14: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 14 -

Correlation Analysis Statistical techniques used to measure the strength of the

relationship between two variables

Correlation Coefficient: describes the strength of the relationship between two sets of variables• Denoted r• r assumes a value between –1 and +1• r = -1 or r = +1 indicates a perfect correlation• r = 0 indicates not relationship between the two sets of variables• Direction of the relationship is given by the coefficient’s sign• Strength of relationship does not depend on the direction• r means LINEAR relationship ONLY

Against All Odds: Correlation

Page 15: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 15 -

Example Correlations

-2 -1 0 1 2

-2-1

01

2r=-0.9

-2 -1 0 1 2

-1.0

0.0

1.0

r=-0.73

-2 -1 0 1 2

-1.0

0.0

1.0

r=-0.25

-2 -1 0 1 2

-1.0

0.0

0.5

1.0

1.5

r=0.34

-2 -1 0 1 2

-2-1

01

r=0.7

-2 -1 0 1 2-3

-2-1

01

23

r=0.88Correlation Demo

Page 16: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 16 -

Scatterplot

r = 0.88

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

score

sale

s

Page 17: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 17 -

Does Correlation Imply Causation??

Page 18: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 18 -

Correlation and Causation Must be very careful in interpreting correlation coefficients

Just because two variables are highly correlated does not mean that one causes the other • Ice cream sales and the number of shark attacks on swimmers are correlated• The miracle of the "Swallows" of Capistrano takes place each year at the Mission San

Juan Capistano, on March 19th and is accompanied by a large number of human births around the same time

• The number of cavities in elementary school children and vocabulary size have a strong positive correlation.

To establish causation, a designed experiment must be run

CORRELATION DOES NOT IMPLY CAUSATION

Against All Odds: Causation

Page 19: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 19 -

Testing the Significance of the Correlation Coefficient

When calculating a correlation coefficient, an obvious question arises: Is the implied relationship statistically significant, or due to random chance?

We can perform a hypothesis test testing whether there is significant evidence against the correlation coefficient being zero

0:0:0

AHH

Page 20: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 20 -

Example

Correlations: Score, Sales

Pearson's product-moment correlation

data: score and sales t = 3.1751, df = 3, p-value = 0.05028alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.01947053 0.99189749 sample estimates: cor 0.8778762

What can we determine about this correlation coefficient?

Page 21: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 21 -

The Regression EffectSir Francis Galton (1822-1911) studied a data set of 1,078 heights of fathers and sons. His data

set is on the next page. Two lines are superimposed: The dashed line is a 45-degree line, shifted up by one inch since on the average the sons were one inch taller than the fathers. The solid line is one that better fits the data set. (It’s the least squares line, to be described soon.)

Galton observed that tall fathers tend to have tall sons but the sons are not, on average, as tall as the fathers. Also, short fathers have short sons who, however, are not as short on average as their fathers. Galton called this effect “regression to the mean”.

In other words, the son’s height tends to be closer to the overall mean height than the father’s height was. Nowadays, the term “regression” is used more generally in statistics to refer to the process of fitting a line to data.

Page 22: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 22 -

The Regression Effect (cont)

Page 23: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 23 -

Regression Analysis Simple Regression Analysis is predicting one variable

from another• Past data on relevant variables are used to create and evaluate a

prediction equation

Variable being predicted is called the dependent variable

Variable used to make prediction is an independent variable

Page 24: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 24 -

Introduction to Regression Predicting future values of a variable is a crucial management

activity• Future cash flows• Needs to raw materials into a supply chain• Future personnel or real estate needs

Explaining past variation is also an important activity• Explain past variation in demand for services• Impact of an advertising campaign or promotion

Against All Odds: Describing Relationships

Page 25: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 25 -

Introduction to Regression (cont.)

Prediction: Reference to future values Explanation: Reference to current or past values Simple Linear Regression: Single independent variable

predicting a dependent variable• Independent variable is typically something we can control• Dependent variable is typically something that is linearly related to the

value of the independent variable

xy 10ˆˆˆ

Page 26: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 26 -

Introduction to Regression (cont.)

Basic Idea: Fit a straight line that relates dependent variable (y) and independent variable (x)

Linearity Assumption: Slope of the equation does not change as x change

Assuming linearity we can write

which says that Y is made up of a predictable part (due to X) and an unpredictable part

Coefficients are interpreted as the true, underlying intercept and slope

xy 10

Page 27: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 27 -

Regression Assumptions

We start by assuming that for each value of X, the correspondingvalue of Y is random, and has a normal distribution.

Page 28: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 28 -

Formal Assumptions of Regression Analysis

),0(~ d;distributenormally are errors The 4.

othereach oft independen are errors The 3.

;)( variance;same thehave all errors The 2.

;0)( that so linear,fact in isrelation The 1.

2

NORMAL

iVar

iE

i

i

i

Page 29: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 29 -

Which Line?

There are many good fitting lines through these points

Regression by Eye Applet

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

score

sale

s

Page 30: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 30 -

Least Squares Principle This method gives a best-fitting straight line by minimizing the

sum of the squares of the vertical deviations about the line

Regression Coefficient Interpretations:• 0: Y-Intercept; estimated value of Y when X = 0

• 1: Slope of the line; average change in predicted value of Y for each change of one unit in the independent variable X

Page 31: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 31 -

Least Square Estimates

iiixy

iixx

xx

xy

yyxxS

xxS

where

xySS

))((

)(

ˆˆ;ˆ

2

101

Page 32: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 32 -

Back to the Example

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

x

y

y = 1133.33 x + 1199.99

simple.lm(score, sales)

Page 33: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 33 -

Standard Error of Estimate Note that not all the points lie on the regression line If they were all on the line, there would be no prediction error Need a measure to indicate how precise the predict of Y is

based on X Standard Error of Estimate: Measures the scatter of the

observed values around the regression line

2)ˆ( 2

nyy

s

Page 34: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 34 -

Back to the Example

Page 35: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 35 -

Affecting Parameter Estimation High Leverage Points: Points that have very high or very low values of

the independent variable• Carry great weight in estimation of the slope

High Influence Point: A high leverage point that also happens to correspond to a y outlier• Alter slope and twist line

Outlier: Points that are in the range of the independent variable but vertically distant from the rest of the observations• Inflate the y-intercept term

Most statistical packages provide automatic diagnostics

http://www.stat.sc.edu/~west/javahtml/Regression.html

Page 36: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 36 -

Checking Model Assumptions (cont.)

Residuals vs. fitted: Look for trend in spread around y = 0

Normal plot: Check if residuals are normally distributed

Scale-Location: Tallest points are the largest residuals

Cook’s Distance: Identifies influential points

Page 37: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 37 -

Checking Model Assumptions (cont.)

6000 8000 10000 12000

-200

00

1000

3000

Fitted values

Res

idua

ls

Residuals vs Fitted4

5

2

-1.0 -0.5 0.0 0.5 1.0

-1.5

-0.5

0.5

1.5

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q plot4

5

2

6000 8000 10000 12000

0.0

0.4

0.8

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location plot4

5

2

1 2 3 4 5

0.0

1.0

2.0

3.0

Obs. number

Coo

k's

dist

ance

Cook's distance plot5

4

1

Page 38: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 38 -

Inferences for Estimates The slope, intercept, and residual standard deviation in a

simple regression model are estimates based on limited data

Just like all other statistical quantities, they are affected by random error

We can apply ideas of hypothesis tests and confidence intervals to the regression estimates

Page 39: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 39 -

t Test of H0: 1=0

2/0

0

0

1

1

1

10

|| if Reject 3. if Reject 2. if Reject 1.

,error I Type and freedom of degrees 2-nFor :RegionRejection

/0ˆ

:StatisticTest

0 3. 0 2. 0 1. :

0:

ttHttH

ttH

Sst

HH

xx

A

Page 40: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 40 -

Back to the Example

Regression Analysis: Sales versus Score> summary(fit)

Call:lm(formula = y ~ x)

Residuals: 1 2 3 4 5 -6.000e+02 -7.333e+02 2.558e-13 2.867e+03 -1.533e+03

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1200.0 2313.2 0.519 0.6398 x 1133.3 356.9 3.175 0.0503 .---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1955 on 3 degrees of freedomMultiple R-Squared: 0.7707, Adjusted R-squared: 0.6942 F-statistic: 10.08 on 1 and 3 DF, p-value: 0.05028

Page 41: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 41 -

Coefficient of Determination R2: Percentage of the variation in the dependent variable that is explained

by the variation in the independent variable

R2 takes on a value between 0 and 100% inclusive

Used to assess “how good” the model is

Residual standard error: 1955 on 3 degrees of freedomMultiple R-Squared: 0.7707, Adjusted R-squared: 0.6942 F-statistic: 10.08 on 1 and 3 DF, p-value: 0.05028

Page 42: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 42 -

Example

Page 43: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 43 -

Example It is well known that the more beer you drink, the more your

blood alcohol level rises. However, the extent to how much it rises per additional

beer is not clear.

Calculate the correlation coefficient Perform a regression analysis

Student 1 2 3 4 5 6 7 8 9 10Beers 5 2 9 8 3 7 3 5 3 5BAL 0.100 0.030 0.190 0.120 0.040 0.095 0.070 0.060 0.020 0.050

Page 44: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 44 -

Homework Hildebrand/Ott

• 11.10• 11.11• 11.12• 11.26• 11.44• 11.45• 11.46• 11.47

• Read Chapter 12

Verzani


Recommended