Cal State Northridge 427 Ainsworth

transcript

Cal State Northridge427AinsworthCorrelation and Regression

Chapter 9 Correlation

Major Points - CorrelationQuestions answered by correlationScatterplotsAn exampleThe correlation coefficientOther kinds of correlations Factors affecting correlationsTesting for significance

The QuestionAre two variables related?Does one increase as the other increases?e. g. skills and incomeDoes one decrease as the other increases?e. g. health problems and nutritionHow can we get a numerical measure of the degree of relationship?

ScatterplotsAKA scatter diagram or scattergram.Graphically depicts the relationship between two variables in two dimensional space.

Direct Relationship

Chart1

12

15

10

14

13

6

18

15

15

10

12

12

13

9

16

15

13

11

7

14

11

6

8

13

12

12

13

10

13

12

Average Hours of Video Games Per Week

Average Number of Alcoholic Drinks Per Week

Scatterplot:Video Games and Alcohol Consumption

correlations

1367

486

1264

1159

760

876

1061

495

780

377

1568

1475

591

1439

790

1066

1071

086

1579

1647

1749

1479

560

1248

1266

1365

1155

1455

477

1544

-0.6415181796

1912

1815

1310

1514

1213

96

1818

1915

2015

1310

1612

2012

2213

129

2316

1515

1313

1611

127

1914

1611

76

38

1713

1112

1712

2013

1010

1813

1512

0.7382131408

correlations

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0



Correlation Between Video Games and Alcohol Consumption

Inverse Relationship

Chart2

67

86

64

59

60

76

61

95

80

77

68

75

91

39

90

66

71

86

79

47

49

79

60

48

66

65

55

55

77

44


Exam Score

Scatterplot: Video Games and Test Score

correlations

1367

486

1264

1159

760

876

1061

495

780

377

1568

1475

591

1439

790

1066

1071

086

1579

1647

1749

1479

560

1248

1266

1365

1155

1455

477

1544

-0.6415181796

1912

1815

1310

1514

1213

96

1818

1915

2015

1310

1612

2012

2213

129

2316

1515

1313

1611

127

1914

1611

76

38

1713

1112

1712

2013

1010

1813

1512

0.7382131408

correlations

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0


Exam Score

Correlation Between Video Games and Test Score

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0



Correlation Between Video Games and Alcohol Consumption

An ExampleDoes smoking cigarettes increase systolic blood pressure?Plotting number of cigarettes smoked per day against systolic blood pressureFairly moderate relationshipRelationship is positive

Trend?

Smoking and BPNote relationship is moderate, but real.Why do we care about relationship?What would conclude if there were no relationship?What if the relationship were near perfect?What if the relationship were negative?

Heart Disease and CigarettesData on heart disease and cigarette smoking in 21 developed countries (Landwehr and Watkins, 1987) Data have been rounded for computational convenience.The results were not affected.

The DataSurprisingly, the U.S. is the first country on the list--the country with the highest consumption and highest mortality.

Sheet1

CountryCigarettesCHD(X - Xbar)(Y - Ybar)(X - Xbar)(Y - Ybar)

111265.0511.4857.97

29213.056.4819.76

39243.059.4828.91

49213.056.4819.76

58192.054.489.18

68132.05-1.52-3.12

78192.054.489.18

86110.05-3.52-0.18

96230.058.480.42

10515-0.950.48-0.46

11513-0.95-1.521.44

1254-0.95-10.529.99

13518-0.953.48-3.31

14512-0.95-2.522.39

1553-0.95-11.5210.94

16411-1.95-3.526.86

17415-1.950.48-0.94

1846-1.95-8.5216.61

19313-2.95-1.524.48

2034-2.95-10.5231.03

21314-2.95-0.521.53

Mean5.9514.52

SD2.336.69

Sum222.44

Sheet2

Sheet3

Scatterplot of Heart DiseaseCHD Mortality goes on ordinate (Y axis)Why?Cigarette consumption on abscissa (X axis)Why?What does each dot represent?Best fitting line included for clarity

{X = 6, Y = 11}


What Does the Scatterplot Show?As smoking increases, so does coronary heart disease mortality.Relationship looks strongNot all data points on line.This gives us residuals or errors of predictionTo be discussed later

CorrelationCo-relationThe relationship between two variablesMeasured with a correlation coefficientMost popularly seen correlation coefficient: Pearson Product-Moment Correlation

Types of CorrelationPositive correlationHigh values of X tend to be associated with high values of Y.As X increases, Y increasesNegative correlationHigh values of X tend to be associated with low values of Y.As X increases, Y decreasesNo correlationNo consistent tendency for values on Y to increase or decrease as X increases

Correlation CoefficientA measure of degree of relationship.Between 1 and -1Sign refers to direction.Based on covarianceMeasure of degree to which large scores on X go with large scores on Y, and small scores on X go with small scores on YThink of it as variance, but with 2 variables instead of 1 (What does that mean??)

CovarianceRemember that variance is:

The formula for co-variance is:

How this works, and why?When would covXY be large and positive? Large and negative?


Example

Country

X (Cig.)

Y (CHD)

*

1

11

26

5.05

11.48

57.97

2

9

21

3.05

6.48

19.76

3

9

24

3.05

9.48

28.91

4

9

21

3.05

6.48

19.76

5

8

19

2.05

4.48

9.18

6

8

13

2.05

-1.52

-3.12

7

8

19

2.05

4.48

9.18

8

6

11

0.05

-3.52

-0.18

9

6

23

0.05

8.48

0.42

10

5

15

-0.95

0.48

-0.46

11

5

13

-0.95

-1.52

1.44

12

5

4

-0.95

-10.52

9.99

13

5

18

-0.95

3.48

-3.31

14

5

12

-0.95

-2.52

2.39

15

5

3

-0.95

-11.52

10.94

16

4

11

-1.95

-3.52

6.86

17

4

15

-1.95

0.48

-0.94

18

4

6

-1.95

-8.52

16.61

19

3

13

-2.95

-1.52

4.48

20

3

4

-2.95

-10.52

31.03

21

3

14

-2.95

-0.52

1.53

Mean

5.95

14.52

SD

2.33

6.69

Sum

222.44

_1222636204.unknown

_1222636234.unknown

_1222636241.unknown

_1222636165.unknown

Example*What the heck is a covariance? I thought we were talking about correlation?

Correlation CoefficientPearsons Product Moment CorrelationSymbolized by rCovariance (product of the 2 SDs)

Correlation is a standardized covariance

Calculation for ExampleCovXY = 11.12sX = 2.33sY = 6.69

ExampleCorrelation = .713Sign is positiveWhy?If sign were negativeWhat would it mean?Would not alter the degree of relationship.

Other calculations*Z-score method

Computational (Raw Score) Method

Other Kinds of CorrelationSpearman Rank-Order Correlation Coefficient (rsp)used with 2 ranked/ordinal variablesuses the same Pearson formula*


Sheet1

AttractivenessSymmetry

32

46

11

23

54

65

0.77

Sheet2

Sheet3

Other Kinds of CorrelationPoint biserial correlation coefficient (rpb)used with one continuous scale and one nominal or ordinal or dichotomous scale.uses the same Pearson formula*


Sheet1

AttractivenessDate?

30

40

11

21

51

60

-0.49

Sheet2

Sheet3

Other Kinds of CorrelationPhi coefficient ()used with two dichotomous scales.uses the same Pearson formula*


Sheet1

AttractivenessDate?

00

10

11

11

00

11

0.71

Sheet2

Sheet3

Factors Affecting rRange restrictionsLooking at only a small portion of the total scatter plot (looking at a smaller portion of the scores variability) decreases r.Reducing variability reduces rNonlinearityThe Pearson r (and its relatives) measure the degree of linear relationship between two variablesIf a strong non-linear relationship exists, r will provide a low, or at least inaccurate measure of the true relationship.

Factors Affecting rHeterogeneous subsamplesEveryday examples (e.g. height and weight using both men and women)OutliersOverestimate CorrelationUnderestimate Correlation

Countries With Low Consumptions

Truncation*

Non-linearity*

Heterogenous samples*

Outliers*

Testing Correlations*So you have a correlation. Now what?In terms of magnitude, how big is big?Small correlations in large samples are big.Large correlations in small samples arent always big.Depends upon the magnitude of the correlation coefficientANDThe size of your sample.

Testing rPopulation parameter = Null hypothesis H0: = 0Test of linear independenceWhat would a true null mean here?What would a false null mean here?Alternative hypothesis (H1) 0Two-tailed

Tables of SignificanceWe can convert r to t and test for significance:

Where DF = N-2

Tables of SignificanceIn our example r was .71N-2 = 21 2 = 19

T-crit (19) = 2.09Since 6.90 is larger than 2.09 reject r = 0.

Computer PrintoutPrintout gives test of significance.

Regression


What is regression?*How do we predict one variable from another?How does one variable change as the other changes?Influence

Linear Regression*A technique we use to predict the most likely score on one variable from those on another variableUses the nature of the relationship (i.e. correlation) between two variables to enhance your prediction

Linear Regression: Parts*Y - the variables you are predictingi.e. dependent variableX - the variables you are using to predicti.e. independent variable - your predictions (also known as Y)

Why Do We Care?*We may want to make a prediction.More likely, we want to understand the relationship.How fast does CHD mortality rise with a one unit increase in smoking?Note: we speak about predicting, but often dont actually predict.

An Example*Cigarettes and CHD Mortality againData repeated on next slideWe want to predict level of CHD mortality in a country averaging 10 cigarettes per day.

The Data*Based on the data we have what would we predict the rate of CHD be in a country that smoked 10 cigarettes on average?First, we need to establish a prediction of CHD from smoking

Sheet1

CountryCigarettesCHD(X - Xbar)(Y - Ybar)(X - Xbar)(Y - Ybar)

111265.0511.4857.97

29213.056.4819.76

39243.059.4828.91

49213.056.4819.76

58192.054.489.18

68132.05-1.52-3.12

78192.054.489.18

86110.05-3.52-0.18

96230.058.480.42

10515-0.950.48-0.46

11513-0.95-1.521.44

1254-0.95-10.529.99

13518-0.953.48-3.31

14512-0.95-2.522.39

1553-0.95-11.5210.94

16411-1.95-3.526.86

17415-1.950.48-0.94

1846-1.95-8.5216.61

19313-2.95-1.524.48

2034-2.95-10.5231.03

21314-2.95-0.521.53

Mean5.9514.52

SD2.336.69

Sum222.44

Sheet2

Sheet3

*For a country that smokes 6 C/A/DWe predict a CHD rate of about 14Regression Line


Regression Line*Formula

= the predicted value of Y (e.g. CHD mortality) X = the predictor variable (e.g. average cig./adult/country)

Regression Coefficients*Coefficients are a and bb = slope Change in predicted Y for one unit change in Xa = intercept value of when X = 0

Calculation*Slope

Intercept

For Our Data*CovXY = 11.12s2X = 2.332 = 5.447b = 11.12/5.447 = 2.042a = 14.524 - 2.042*5.952 = 2.32See SPSS printout on next slideAnswers are not exact due to rounding error and desire to match SPSS.

SPSS Printout*

Note:*The values we obtained are shown on printout.The intercept is the value in the B column labeled constant The slope is the value in the B column labeled by name of predictor variable.

Making a Prediction*Second, once we know the relationship we can predict

We predict 22.77 people/10,000 in a country with an average of 10 C/A/D will die of CHD

Accuracy of PredictionFinnish smokers smoke 6 C/A/DWe predict:

They actually have 23 deaths/10,000Our error (residual) = 23 - 14.619 = 8.38a large error*


*Cigarette Consumption per Adult per Day12108642CHD Mortality per 10,0003020100ResidualPrediction


Residuals*When we predict for a given X, we will sometimes be in error. Y for any X is a an error of estimateAlso known as: a residualWe want to (Y- ) as small as possible.BUT, there are infinitely many lines that can do this.Just draw ANY line that goes through the mean of the X and Y values.Minimize Errors of Estimate How?

Minimizing Residuals*Again, the problem lies with this definition of the mean:

So, how do we get rid of the 0s?Square them.

Regression Line: A Mathematical DefinitionThe regression line is the line which when drawn through your data set produces the smallest value of:

Called the Sum of Squared Residual or SSresidualRegression line is also called a least squares line.*


Summarizing Errors of Prediction*Residual varianceThe variability of predicted values

Standard Error of Estimate*Standard error of estimateThe standard deviation of predicted values

A common measure of the accuracy of our predictionsWe want it to be as small as possible.

Example*


Sheet1

CountryX (Cig.)Y (CHD)Y'(Y - Y')(Y' - Ybar)(Y - Ybar)

1112624.8291.1711.371106.193131.699

292120.7450.2550.06538.70141.939

392420.7453.25510.59538.70189.795

492120.7450.2550.06538.70141.939

581918.7030.2970.08817.46420.035

681318.703-5.70332.52417.4642.323

781918.7030.2970.08817.46420.035

861114.619-3.61913.0970.00912.419

962314.6198.38170.2410.00971.843

1051512.5772.4235.8713.7910.227

1151312.5770.4230.1793.7912.323

125412.577-8.57773.5653.791110.755

1351812.5775.42329.4093.79112.083

1451212.577-0.5770.3333.7916.371

155312.577-9.57791.7193.791132.803

1641110.5350.4650.21615.91212.419

1741510.5354.46519.93615.9120.227

184610.535-4.53520.56615.91272.659

193138.4934.50720.31336.3732.323

20348.493-4.49320.18736.373110.755

213148.4935.50730.32736.3730.275

Mean5.95214.524

SD2.3346.690

Sum0.04440.757454.307895.247895.06

Y' = (2.04*X) + 2.37

Sheet2

Sheet3

Regression and Z Scores*When your data are standardized (linearly transformed to z-scores), the slope of the regression line is called DO NOT confuse this with the associated with type II errors. Theyre different.When we have one predictor, r = Zy = Zx, since A now equals 0

Sums of square deviationsTotal

Regression

Residual we already covered

SStotal = SSregression + SSresidualPartitioning Variability*

Partitioning Variability*Degrees of freedomTotaldftotal = N - 1Regressiondfregression = number of predictorsResidualdfresidual = dftotal dfregressiondftotal = dfregression + dfresidual

Partitioning Variability*Variance (or Mean Square)Total Variances2total = SStotal/ dftotalRegression Variances2regression = SSregression/ dfregressionResidual Variances2residual = SSresidual/ dfresidual

Example*

Sheet1

CountryX (Cig.)Y (CHD)Y'(Y - Y')(Y' - Ybar)(Y - Ybar)

1112624.8291.1711.371106.193131.699

292120.7450.2550.06538.70141.939

392420.7453.25510.59538.70189.795

492120.7450.2550.06538.70141.939

581918.7030.2970.08817.46420.035

681318.703-5.70332.52417.4642.323

781918.7030.2970.08817.46420.035

861114.619-3.61913.0970.00912.419

962314.6198.38170.2410.00971.843

1051512.5772.4235.8713.7910.227

1151312.5770.4230.1793.7912.323

125412.577-8.57773.5653.791110.755

1351812.5775.42329.4093.79112.083

1451212.577-0.5770.3333.7916.371

155312.577-9.57791.7193.791132.803

1641110.5350.4650.21615.91212.419

1741510.5354.46519.93615.9120.227

184610.535-4.53520.56615.91272.659

193138.4934.50720.31336.3732.323

20348.493-4.49320.18736.373110.755

213148.4935.50730.32736.3730.275

Mean5.95214.524

SD2.3346.690

Sum0.04440.757454.307895.247895.06

Y' = (2.04*X) + 2.37

Sheet2

Sheet3

Example*

Coefficient of Determination*It is a measure of the percent of predictable variability

The percentage of the total variability in Y explained by X

r = .713r 2 = .7132 =.508

or

Approximately 50% in variability of incidence of CHD mortality is associated with variability in smoking.r 2 for our example*

Coefficient of Alienation*It is defined as 1 - r 2 or

Example1 - .508 = .492

r2, SS and sY-Y*r2 * SStotal = SSregression(1 - r2) * SStotal = SSresidualWe can also use r2 to calculate the standard error of estimate as:

Testing Overall Model*We can test for the overall prediction of the model by forming the ratio:

If the calculated F value is larger than a tabled value (F-Table) we have a significant prediction

Testing Overall Model*Example

F-Table F critical is found using 2 things dfregression (numerator) and dfresidual.(demoninator)F-Table our Fcrit (1,19) = 4.3819.594 > 4.38, significant overallShould all sound familiar

SPSS output*

Testing Slope and Intercept*The regression coefficients can be tested for significanceEach coefficient divided by its standard error equals a t value that can also be looked up in a t-table Each coefficient is tested against 0

Testing the Slope*With only 1 predictor, the standard error for the slope is:

For our Example:

Testing Slope and Intercept*These are given in computer printout as a t test.

Testing*The t values in the second from right column are tests on slope and intercept.The associated p values are next to them.The slope is significantly different from zero, but not the intercept.Why do we care?

Testing*What does it mean if slope is not significant?How does that relate to test on r?What if the intercept is not significant?Does significant slope mean we predict quite well?

*********Landwehr, J.M. & Watkins, A.E. (1987) Exploring Data: Teachers Edition. Palo Alto, CA: Dale Seymour Publications.

*****************With updates on slides 23, 25, 27, 29, 35, 36*Landwehr, J.M. & Watkins, A.E. (1987) Exploring Data: Teachers Edition. Palo Alto, CA: Dale Seymour Publications.

**

Cal State Northridge 427 Ainsworth

Documents