+ All Categories
Home > Documents > Statistics for Everyone Workshop Fall 2010

Statistics for Everyone Workshop Fall 2010

Date post: 06-Feb-2016
Category:
Upload: maura
View: 47 times
Download: 0 times
Share this document with a friend
Description:
Statistics for Everyone Workshop Fall 2010. Part 6A Assessing the Relationship Between 2 Numerical Variables Using Correlation. Workshop presented by Linda Henkel and Laura McSweeney of Fairfield University - PowerPoint PPT Presentation
Popular Tags:
84
Statistics for Everyone Workshop Fall 2010 Part 6A Assessing the Relationship Between 2 Numerical Variables Using Correlation Workshop presented by Linda Henkel and Laura McSweeney of Fairfield University Funded by the Core Integration Initiative and the Center for Academic Excellence at Fairfield University
Transcript
Page 1: Statistics for Everyone Workshop  Fall 2010

Statistics for Everyone Workshop Fall 2010

Part 6A

Assessing the Relationship Between 2

Numerical Variables Using Correlation

Workshop presented by Linda Henkel and Laura McSweeney of Fairfield University

Funded by the Core Integration Initiative and the Center for Academic Excellence at Fairfield University

Page 2: Statistics for Everyone Workshop  Fall 2010

What if the research question we want to ask is whether or how strongly two variables are related to each other, but we do not have experimental control over those variables and we rely on already existing conditions?

• TV viewing and obesity

• Age and severity of flu symptoms

• Tire pressure and gas mileage

• Amount of pressure and compression of insulation

Assessing the Relationship Between 2 Numerical Variables With Correlation

Page 3: Statistics for Everyone Workshop  Fall 2010

The nature of the research question is about the association between two variables, not about whether one causes the other

Correlation describes and quantifies the systematic, linear relation between variables

Correlation Causation

Assessing the Relationship Between 2 Variables With Correlation

Page 4: Statistics for Everyone Workshop  Fall 2010

Statistics as a Tool in Scientific Research

Types of Research Questions• Descriptive (What does X look like?)• Correlational (Is there an association between

X and Y? As X increases, what does Y do?)• Experimental (Do changes in X cause changes in

Y?)

Different statistical procedures allow us to answer the different kinds of research questions

Page 5: Statistics for Everyone Workshop  Fall 2010

Correlation: type and strength of linear relationship between X and Y

Linear Regression: making predictions of Variable Y based on knowing the value of Variable X

Page 6: Statistics for Everyone Workshop  Fall 2010

Linear Correlation Test

Statistical test for Pearson correlation coefficient (r)

Used for: Analyzing the strength of the linear relationship between two numerical variables

Use when: Both variables are numerical (interval or ratio) and the research question is about the type and strength of relation (not about causality)

Page 7: Statistics for Everyone Workshop  Fall 2010

Other Correlation Coefficients

There are many different types of correlation coefficients – used for different types of variables -- but we focus only on Pearson r

Point biserial (rPB): Use when one variable is nominal and has two levels (e.g., gender [male/female], type of car [gas-powered, hybrid]) and one variable is numerical (e.g., reaction time; miles per gallon)

Spearman rank order (rS): Use for ordinal data or numerical data that are not normally distributed or linear

Calculators for these are available at: http://faculty.vassar.edu/lowry/VassarStats.html

Page 8: Statistics for Everyone Workshop  Fall 2010

The Essence of the Correlation Coefficient

The correlation coefficient indicates whether or not a relationship exists (association, co-occurrence, covariation)

The value of the correlation coefficient tells you about the type and strength of the relationship• Positive, negative• Strong, moderately strong, weak, no relation

Page 9: Statistics for Everyone Workshop  Fall 2010

Different Types of Correlations

Positive: As X increased, Y increased and as X decreased, Y decreased

Negative: As X increased, Y decreased and as X decreased, Y increased

Page 10: Statistics for Everyone Workshop  Fall 2010

Different Types of Correlations

• No systematic relation between X and Y

• High values of X are associated with both high & low values of Y;

• Low values of X are associated with both high & low values of Y

Page 11: Statistics for Everyone Workshop  Fall 2010

Different Types of Correlations

Curvilinear: Not straight

U: As X increased, Y decreased up to a point then increased

Inverted U: As X increased, Y increased up to a point, then decreased

Page 12: Statistics for Everyone Workshop  Fall 2010

Describing Linear Correlation

Pearson correlation coefficient (r)

Direction/type: Positive, Negative, or No correlation

Magnitude/strength: r = 1 perfect correlation

r closer to 1 strong correlation r closer to 0 weak or no correlation

• r is unitless; r is NOT a proportion or percentage• It doesn’t matter which variable you call X or Y when

calculating r

Page 13: Statistics for Everyone Workshop  Fall 2010

Issues in Interpreting Correlation Coefficient

Magnitude of the Effect:

|r| Conclusion About Relationship 0 to .20 negligible to weak.20 to .40 weak to moderate.40 to .60 moderate to strong.60 to 1.0 strong

Page 14: Statistics for Everyone Workshop  Fall 2010

Example 1: What can we say?

Deposit

Time(s)

Oxide thickness

(Angstroms)

18 1059

35 1049

52 1039

52 1026

18 1001

23 1263

etc. etc.

0

200

400

600

800

1000

1200

1400

1600

10 20 30 40 50

Deposit Time (s)

Ox

ide

Th

ick

ne

ss

(A

ng

str

om

s)

Page 15: Statistics for Everyone Workshop  Fall 2010

What can we say?

There is little or no correlation between oxide thickness and deposit time.

In fact, r = .002.

Page 16: Statistics for Everyone Workshop  Fall 2010

Example 2: What can we say?

Surface

Area to

Volume

Drug Releas

e Rate

1.5 60

1.05 48

0.9 39

0.75 33

0.6 30

0.65 29

0

10

20

30

40

50

60

70

0 0.5 1 1.5 2

Surface Area To Volume (mm^2/mm^3)

Dru

g R

elea

se r

ate

(% R

elea

sed

)

Page 17: Statistics for Everyone Workshop  Fall 2010

What can we say?

There was a strong, positive correlation between surface area to volume and the drug release rate.

As the surface area to volume increased, the drug release rate increased

The smaller the surface area to volume, the lower the drug release rate

Page 18: Statistics for Everyone Workshop  Fall 2010

What More Does the Correlation Coefficient Tell Us?

Surface

Area to

Volume

Ratio

Drug Releas

e Rate

1.5 60

1.05 48

0.9 39

0.75 33

0.6 30

0.65 29

0

10

20

30

40

50

60

70

0 0.5 1 1.5 2

Surface Area To Volume Ratio (mm^2/mm^3)

Dru

g R

elea

se R

ate

(%

rele

ased

)

r = .99

Page 19: Statistics for Everyone Workshop  Fall 2010

Interpreting Correlation Coefficient

A strong linear association was found between the surface area to volume and the drug release rate, r = .99

There was a significant positive correlation between the surface area to volume and the drug release rate, r = .99

As the surface area to volume increased the drug release rate increased, and this was a strong linear relationship, r = .99

Page 20: Statistics for Everyone Workshop  Fall 2010

Testing for Linear Correlation

Correlations are descriptive

We can describe the type and strength of the linear relationship between two variables by examining the r value

r values are associated with p values that reflect the probability that the obtained relationship is real or is more likely just due to chance

Page 21: Statistics for Everyone Workshop  Fall 2010

Running and Interpreting a Correlation

The key research question in a correlational design is: Is there a real linear relationship between the two variables or is the obtained pattern no different what we would expect just by chance?

i.e. H0: The population correlation coefficient is zero

HA: The population correlation coefficient is not zero

Teaching tip: In order to understand what we mean by “a real relationship,” students must understand probability

Page 22: Statistics for Everyone Workshop  Fall 2010

What Do We Mean by “Just Due to Chance”?

p value = probability of results being due to chance

When the p value is high (p > .05), the obtained relationship is probably due to chance

.99 .75 .55 .25 .15 .10 .07

When the p value is low (p < .05), the obtained relationship is probably NOT due to chance and more likely reflects a real relationship

.04 .03 .02 .01 .001

Page 23: Statistics for Everyone Workshop  Fall 2010

What Do We Mean by “Just Due to Chance”?

p value = probability of results being due to chance

[Probability of observing your data (or more severe) if H0 were true]

When the p value is high (p > .05), the obtained relationship is probably due to chance

[Data likely if H0 were true]

.99 .75 .55 .25 .15 .10 .07

When the p value is low (p < .05), the obtained relationship is probably NOT due to chance and more likely reflects a real relationship

[Data unlikely if H0 were true, so data support HA]

.04 .03 .02 .01 .001

Page 24: Statistics for Everyone Workshop  Fall 2010

What Do We Mean by “Just Due to Chance”?

In science, a p value of .05 is a conventionally accepted cutoff point for saying when a result is more likely due to chance or more likely due to a real effect

Not significant = the obtained relationship is probably due to chance; the relationship observed does not appear to really differ from what would be expected based on chance; p > .05

Statistically significant = the obtained relationship is probably NOT due to chance and is likely a real linear relationship between the two variables; p < .05

Page 25: Statistics for Everyone Workshop  Fall 2010

Finding the Linear Correlation Using Excel

Step 1: Make a scatterplot of the data

If there is a linear trend, go to Step 2

Step 2: Get the Pearson Correlation Coefficient (r)

Using the = correl(X, Y) function

[Refer to handouts for more detail]

Page 26: Statistics for Everyone Workshop  Fall 2010

Running a Test for Linear Correlation Using Excel

Pearson’s r: Both variables are numerical (interval or ratio) and the research question is about the type and strength of relation (not about causality)

Need: • Pearson correlation coefficient (r)• Number of pairs of observations (N)

To run: Open Excel file “SFE Statistical Tests” and go to page called Linear Correlation Test

Enter the correlation coefficient (r) and number of pairs of observations (N)

Output: Computer calculates the p value

Page 27: Statistics for Everyone Workshop  Fall 2010

Running a Test for Linear Correlation Using SPSS

When to Calculate Pearson’s r: Both variables are numerical (interval or ratio) and the research question is about the type and strength of relation (not about causality)

 

To run:

Analyze Correlate Bivariate

Move the X and Y variables over.

Check “Pearson Correlation” and “Two-tailed test of

significance”. Then click Ok.

 

Output: Computer calculates r and the p value

Page 28: Statistics for Everyone Workshop  Fall 2010

Reporting Results of Correlation

If the relationship was significant (p < .05)

(a) Say: A [say something about size: weak, moderate, strong] [say direction: positive/negative] correlation was found between [X] and [Y] that was statistically significant, r = .xx, p = .xx

(b) Describe the relation: Thus as X increased, Y [increased/decreased]

Note: If the r value is positive, then as X increased, Y increased; if r is negative, then as X increased, Y decreased

e.g., A strong positive correlation was found between surface area to volume ratio and the drug release rate that was statistically significant, r = .99, p = .0001. As the surface area to volume ratio increased, the drug release rate increased.

Page 29: Statistics for Everyone Workshop  Fall 2010

Reporting Results of Correlation

If the relationship was not significant (p > .05)

Say: No statistically significant correlation between [X] and [Y] was found, r = .xx, p = .xx. Thus as X increased, Y neither increased nor decreased systematically.

e.g., No statistically significant correlation between oxide thickness and deposit time was found, r = .002, p = .99. Thus as oxide thickness increased, deposit time neither increased nor decreased systematically.

Page 30: Statistics for Everyone Workshop  Fall 2010

Teaching Tip

Impress upon your students that an association does not imply causation

e.g., Average life expectancy and the average number of TVs per household are highly correlated.

But you can’t increase life expectancy by increasing the number of TVs. They are related, but it isn’t a cause-and-effect relationship.

Page 31: Statistics for Everyone Workshop  Fall 2010

More Teaching Tips

You can ask your students to report either:

• the exact p value (p = .03, p = .45)

• the cutoff: say either p < .05 (significant) or p > .05 (not significant)

You should specify which style you expect. Ambiguity confuses them!

Tell students they can only use the word “significant” only when they mean it (i.e., the probability the results are due to chance is less than 5%) and to not use it with adjectives (i.e., they often mistakenly think one test can be “more significant” or “less significant” than another). Emphasize that “significant” is a cutoff that is either met or not met -- Just like you are either found guilty or not guilty, pregnant or not pregnant. There are no gradients. Lower p values = less likelihood results are due to chance, not “more significant”

Page 32: Statistics for Everyone Workshop  Fall 2010

Correlations are Predictive

You can predict one variable from the other if there is a strong correlation.

If so, linear regression can be used to find a linear model which can be used to make predictions.

But, remember to make a scatterplot of the data to see if a linear model is appropriate and/or the best model!

Page 33: Statistics for Everyone Workshop  Fall 2010

Correlation: type and strength of linear relationship between X and Y

Linear Regression: making predictions of Variable Y based on knowing the value of Variable X

Page 34: Statistics for Everyone Workshop  Fall 2010

Example 1

A linear regression seems appropriate since the data have an overall linear trend.

Page 35: Statistics for Everyone Workshop  Fall 2010

Example 2

Even though r = -.92 a linear regression model is not appropriate here. A different model should be used.

Mass of chemical spill at time tr = -.92

01234567

0 10 20 30 40 50 60 70

Time (minutes)

Ma

ss

of

sp

ill (

lbs

).

Page 36: Statistics for Everyone Workshop  Fall 2010

Terminology

X variable: predictor (independent variable, explanatory variable, what we know)

Y variable: criterion (dependent variable, response variable, what we want to predict)

Stronger correlation = better prediction

Page 37: Statistics for Everyone Workshop  Fall 2010

Making Predictions

Time (s)

Oxide thickness

(Angstroms)

18 1059

35 1049

52 1039

52 1026

18 1001

23 1263

etc. etc.

0

200

400

600

800

1000

1200

1400

1600

10 20 30 40 50

TimeO

xid

e t

hic

kn

es

s

Page 38: Statistics for Everyone Workshop  Fall 2010

Making Predictions

Surface

Area to

Volume

Ratio

Drug Releas

e Rate

1.5 60

1.05 48

0.9 39

0.75 33

0.6 30

0.65 29

0

10

20

30

40

50

60

70

0 0.5 1 1.5 2

Surface Area To Volume Ratio (mm^2/mm^3)

Dru

g R

elea

se R

ate

(%

rele

ased

)

Page 39: Statistics for Everyone Workshop  Fall 2010

Using Regression to Make Predictions

Low or no correlation (e.g., Oxide thickness and Deposit time) • Best prediction of Y is the mean of Y (knowing X

doesn’t add anything)

High correlation (e.g., Surface area to volume ratio vs. Drug release rate )• Best prediction of Y is based on knowing X

Page 40: Statistics for Everyone Workshop  Fall 2010

Sample Scatterplots & Regression Lines

If r = +1, Y’ = Y If r = 0, Y’ = MY

Page 41: Statistics for Everyone Workshop  Fall 2010

Terminology

Regression line:

• Best fitting straight line that summarizes a linear relation

• Comprised of the predicted values of Y

(denoted by Y’ )

Page 42: Statistics for Everyone Workshop  Fall 2010

What Do We Mean by the “Best Fit” Line?

Comparing Amounts of Snowfall

0

2

4

6

8

10

0 2 4 6 8

Radar Prediction (inches of snow)

Gau

ge

Mea

sure

d

(in

ches

of

sno

w)

Page 43: Statistics for Everyone Workshop  Fall 2010

What Do We Mean by the “Best Fit” Line?

Comparing Amounts of Snowfall

0

2

4

6

8

10

0 2 4 6 8

Radar Prediction (inches of snow)

Gau

ge

Mea

sure

d (

inch

es o

f sn

ow

)

Page 44: Statistics for Everyone Workshop  Fall 2010

What Do We Mean by the “Best Fit” Line?

Comparing Amounts of Snowfall

0

2

4

6

8

10

0 2 4 6 8

Radar Prediction (inches of snow)

Gau

ge

Mea

sure

d (

inch

es o

f sn

ow

)It is the line that minimizes the vertical distances from the observed points to the line

Page 45: Statistics for Everyone Workshop  Fall 2010

Least Squares Line

Error or residual = observed Y – predicted Y

= Y – Y’

The least-squares line is the “best fit” line that

minimizes the sum of the square errors/residuals

i.e: It minimizes

2'YY

Page 46: Statistics for Everyone Workshop  Fall 2010

Best Fit Line

Also known as linear regression model or least squares line

Used to predict Y using a single predictor X and a linear model

Y’ = bX + a

Y’ = predicted y-valueX = known x-valueb = slopea = y-intercept or the point where line crosses y-axis

[We can use more advanced techniques for multiple predictors or nonlinear models]

Page 47: Statistics for Everyone Workshop  Fall 2010

Interpreting the Model

Y’ = bX + aWhat is b?• Slope of the regression line

• The ratio of how much Y changes relative to a change in one unit of X

• Same sign as the correlation

What is a?• Y-intercept or where the line crosses the Y axis

(where X = 0)

Page 48: Statistics for Everyone Workshop  Fall 2010

Getting the Least Squares Line

While Excel will get the least squares line, we give the following formulas for completeness.

)M(bMa

SD

SDrb

XY

X

Y

Page 49: Statistics for Everyone Workshop  Fall 2010

Getting the Least Squares Line in ExcelLinear Regression: When you want to predict a numerical

variable Y from another numerical variable X

Need: • Enter X and Y data in Excel• Be sure that X column is directly to the left of the Y column

To get linear model: Make a scatterplot of Y vs. X.

Click on the Chart and choose Chart/Add Trendline

Choose the Option tab and select Display equation

Output: Least squares line

Page 50: Statistics for Everyone Workshop  Fall 2010

Getting the Least Squares Line in SPSS

Linear Regression: When you want to predict a numerical variable Y from another numerical variable X

 

Need: Enter X and Y data in SPSS

 

To get linear model:

Analyze Regression Linear;

Move the Y variable to Dependent and the X variable to

Independent; Click Ok.

 

Output: Least squares line

 

Page 51: Statistics for Everyone Workshop  Fall 2010

Example

X = Surface Area to Volume Ratio (mm2/mm3)Y = Drug Release Rate (%released)

The least squares line isY’ = 36.92X + 7.21

If surface area to volume ratio is 1, what is predicted drug release rate? ______

If the actual drug release rate was 40%, what is the residual? ________

Page 52: Statistics for Everyone Workshop  Fall 2010

Example

X = Surface Area to Volume Ratio (mm2/mm3)Y = Drug Release Rate (%released)

The least squares line isY’ = 36.92X + 7.21

If surface area to volume ratio is 1, what is predicted drug release rate?

Y’ = 36.92(1)+7.21 = 44.13% released

If the actual drug release rate was 40%, what is the residual?

Residual = Y – Y’ = 40 – 44.13 = -4.13%

Page 53: Statistics for Everyone Workshop  Fall 2010

Example

X = Surface Area to Volume Ratio (mm2/mm3)

Y = Drug Release Rate (%released)

The least squares line is

Y’ = 36.92X + 7.21

By how much can you expect the drug release rate

to increase or decrease if you increase the surface

area to volume ratio by 1 mm2/mm3?_________

Page 54: Statistics for Everyone Workshop  Fall 2010

Example

X = Surface Area to Volume Ratio (mm2/mm3)Y = Drug Release Rate (%released)

The least squares line isY’ = 36.92X + 7.21

By how much can you expect the drug release rateto increase or decrease if you increase the surface area to volume ratio by 1 mm2/mm3?

This is just the slope, so you would expect the drug release rate to increase by 36.92%

Page 55: Statistics for Everyone Workshop  Fall 2010

What about R2?

Often in correlational research the value R2 is reported.

R2 = Proportion of variability of Y accounted for by X and the linear model

For example, if R2 = .64 then we can say:• 64% of the variance in the Y scores can be predicted

from Y’s (linear) relation with X

• Predictions are 64% more accurate using the linear regression equation to make predictions (Y’) than when we use MY to make predictions

Page 56: Statistics for Everyone Workshop  Fall 2010

How do you get R2?R2 is literally the correlation coefficient, r, times

itself.

R2 = (r)2

0 R2 1

Low r low R2 little improvement over MY

High r high R2 more accurate predictions using Y’

Page 57: Statistics for Everyone Workshop  Fall 2010

How do you get r from R2?

r = Sqrt(R2) or r = – Sqrt(R2)

Remember, the correlation (r) has the same sign as the slope of the least squares line.

Page 58: Statistics for Everyone Workshop  Fall 2010

Issues to Consider

1. You can’t extrapolate outside the given domain of the X’s

2. Be careful of the effect of potential outliers

3. Remember to impress upon your students that an association does not imply causation, no matter how seductive it may be to think that the 2 variables are causally related

Page 59: Statistics for Everyone Workshop  Fall 2010

You can’t extrapolate outside the given domain of the X’s

You can only make predictions in the range of the given data.

We don’t know if the pattern we see continues for surface area to volume ratios less than .6 and greater than 1.5 mm2/mm3.

Exampley = 35.916x + 7.2094

010203040506070

0 0.5 1 1.5 2

Surface Area To Volume Ratio (mm^2/mm^3)

Dru

g R

elea

se R

ate

(%

rele

ased

)

Page 60: Statistics for Everyone Workshop  Fall 2010

Be careful of the effect of potential outliers

Outliers can be influential or not.

Exploring Linear Regression Excel Worksheet

Page 61: Statistics for Everyone Workshop  Fall 2010

Hypothesis Testing For Simple Linear Regression

Y = + X+ where is normally distributed with mean 0 and variance 2

The F test allows a scientist to determine whether their research hypothesis is supported

Null hypothesis H0:

• There is not a linear relationship between X and Y• There is no correlation between X and Y = 0

Page 62: Statistics for Everyone Workshop  Fall 2010

Hypothesis Testing For Simple Linear Regression

Research hypothesis HA: • X is linearly related to Y• The correlation between X and Y is different

from 0 0

Teaching tip: Very important for students to be able to understand and state the research question so that they see that statistics is a tool to answer that question

Page 63: Statistics for Everyone Workshop  Fall 2010

Hypothesis Testing For Simple Linear Regression

Null hypothesis: Pain level is not linearly related to weight

Research hypothesis: Pain level is linearly related to weight

Null hypothesis: There is no linear relationship between age and heart rate

Research hypothesis: There is a linear relationship between age and heart rate

Page 64: Statistics for Everyone Workshop  Fall 2010

Hypothesis Testing For Simple Linear Regression

p value = probability of results being due to chance

When the p value is high (p > .05), the obtained relationship is probably due to chance

.99 .75 .55 .25 .15 .10 .07

When the p value is low (p < .05), the obtained relationship is probably NOT due to chance and more likely reflects a real linear relationship between X and Y

.04 .03 .02 .01 .001

Page 65: Statistics for Everyone Workshop  Fall 2010

Hypothesis Testing For Simple Linear Regression

In science, a p value of .05 is a conventionally accepted cutoff point for saying when a result is more likely due to chance or more likely due to a real effect

Not significant = the obtained relationship is probably due to chance; X does not appear to be linearly related to Y;

p > .05

Statistically significant = the obtained relationship is probably NOT due to chance and X and Y appear to be linearly related; p < .05

Page 66: Statistics for Everyone Workshop  Fall 2010

Sources of Variance in Simple Linear Regression

 SSTotal = SSRegression + SSResidual

 

 (Yi – MY)2 =   (Yi’ – MY)2 +  (Yi – Yi’)2

How much do all the individual scores differ from the grand mean

Total Sums of Squares

How much do all the predicted values differ from the grand mean

Regression Sums

of Squares

How much do the individual scores differ from the predicted values

Residual Sums of Squares

Page 67: Statistics for Everyone Workshop  Fall 2010

Each F test has certain values for degrees of freedom (df), which is based on the sample size (N) and number of conditions, and the F value will be associated with a particular p value

SPSS calculates these numbers.

Summary Table for Simple Linear Regression

Source Sum of Squares (SS)

df Mean

Square (MS)

F

Regression (Yi’ – MY)2 1 SSRegression

dfRegression

MSRegressionMSResidual

Residual (Error)

(Yi – Yi’)2 N - 2 SSResidual

dfResidual

Total (Yi – MY)2 N - 1N= # samples

To report, use the format: F(dfRegression, dfResidual) = x.xx, p _____.

Page 68: Statistics for Everyone Workshop  Fall 2010

The F Ratio

A test for simple linear regression gives you an F ratio

The bigger the F value, the less likely the relationship between X and Y is just due to chance

The bigger the F value, the more likely the relationship between X and Y is not just due to chance and is due to a real relationship

So big values of F will be associated with small p values that indicate the linear relationship is significant (p < .05)

Little values of F (i.e., close to 1) will be associated with larger p values that indicate the linear relationship is not significant (p > .05)

Page 69: Statistics for Everyone Workshop  Fall 2010

Interpreting the F test for Simple Linear Regression

Cardinal rule: Scientists do not say “prove”! Conclusions are based on probability (likely due to chance, likely a real effect…). Be explicit about this to your students.

Based on p value, determine whether you have evidence to conclude the relationship was probably real or was probably due to chance: Is the research hypothesis supported?

p < .05: Significant• Reject null hypothesis and support research hypothesis (the

relationship was probably real; X is linearly related to Y)

p > .05: Not significant• Retain null hypothesis and reject research hypothesis (any

relationship was probably due to chance; X is not linearly related to Y)

Page 70: Statistics for Everyone Workshop  Fall 2010

Teaching TipsStudents have trouble understanding what is less than .05 and

what is greater, so a little redundancy will go a long way!

Whenever you say “p is less than point oh-five” also say, “so the probability that relationship is due to chance is less than 5%, so the 2 variables are likely to be linearly related.”

Whenever you say “p is greater than point oh-five” also say, “so the probability that this is due to chance is greater than 5%, so there’s just not enough evidence to conclude that there is a linear relationship – X and Y are not really linearly related”

In other words, read the p value as a percentage, as odds, “the odds that this relationship is due to chance are 1%, so X and Y probably are linearly related…”

Page 71: Statistics for Everyone Workshop  Fall 2010

Running the F Test for Simple Linear Regression in SPSS

Linear Regression: When you want to test whether there is a linear relationship between two numerical variables (X and Y)

 

Need: Enter X and Y data in SPSS

 

To get linear model:

Analyze Regression Linear;

Move the Y variable to Dependent and the X variable to

Independent; Click Ok.

 

Output: F and p-value

 

Page 72: Statistics for Everyone Workshop  Fall 2010

Reporting the Results of the F Test for Simple Linear Regression

Step 1: Write a sentence that clearly indicates what statistical analysis you used

An F test was used to determine if there was a linear relationship between [X] and [Y].

-Or - An F test was used to determine if [X] is linearly related to [Y].

An F test was used to determine if pain level is linearly related to weight.

An F test was used to determine if there is a linear relationship between age and heart rate.

Page 73: Statistics for Everyone Workshop  Fall 2010

Reporting the Results of the F Test for Simple Linear Regression

Step 2: Report whether the linear relationship was significant or not

The linear relationship between [X] and [Y] was significant [or not significant], F(dfRegression, dfResidual) = X.XX [fill in F], p = xxxx.

The linear relationship between pain level and weight was significant, F(2, 40) = 7.31, p = .01.

The linear relationship between age and heart rate was not significant, F(2, 120) = 2.35, p = .10.

Page 74: Statistics for Everyone Workshop  Fall 2010

Assumptions Involved in Simple Linear Regression

• Linearity

• Randomness and Normality

• Homoscedasticity

Page 75: Statistics for Everyone Workshop  Fall 2010

Assessing Linearity

Is a linear model appropriate?

Is a linear model the best model?

To check this make a scatterplot of the data and see if there is an overall linear trend.

Page 76: Statistics for Everyone Workshop  Fall 2010

Assessing Linearity

Linear model seems appropriate.

Page 77: Statistics for Everyone Workshop  Fall 2010

Assessing Linearity

Linear regression is not appropriate here. Should use a different model.

Mass of chemical spill at time t

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60 70

Time (minutes)

Mas

s o

f sp

ill

(lb

s).

Page 78: Statistics for Everyone Workshop  Fall 2010

Assessing Randomness and Normality of Residuals

Are the residuals random and normally distributed?

The data should be randomly scattered about the least squares line. There should be no patterns about the least squares line.

Page 79: Statistics for Everyone Workshop  Fall 2010

Assessing Normality of Residuals

Points are randomly scattered about least squares line.

Page 80: Statistics for Everyone Workshop  Fall 2010

Assessing Normality of Residuals

Mass of chemical spill at time t

-4

-2

0

2

4

6

8

0 10 20 30 40 50 60 70

Time (minutes)

Mas

s o

f sp

ill

(lb

s).

Points are not randomly scattered about least squares line, so the residuals are not random.

Page 81: Statistics for Everyone Workshop  Fall 2010

Assessing Homoscedasticity

Homoscedasticity = the variability or scatter of points about the best fit line is fairly constant

Check if line fits consistently well for all X’s.

Page 82: Statistics for Everyone Workshop  Fall 2010

Assessing Homoscedasticity

Line fits consistently well throughout.

Since all three assumptions are met for this data set, a linear model is appropriate.

Page 83: Statistics for Everyone Workshop  Fall 2010

Assessing Homoscedasticity

The linear model fits better for weights more than 140. So, there may be a problem with homoscedasticity.

Height vs. Weight

50

55

60

65

70

75

80

100 120 140 160 180 200 220 240

Weight (lbs)

Hei

gh

t (i

n)

Page 84: Statistics for Everyone Workshop  Fall 2010

Time to Practice

Correlation and Linear Regression


Recommended