Chapter 7 -Part 1 Correlation. Correlation Topics zCorrelational research – what is it and how do...

Post on 21-Dec-2015

214 views 0 download

Tags:

transcript

Chapter 7 -Part 1

Correlation

Correlation TopicsCorrelational research – what is it and how

do you do “co-relational” research?

The three questions: Is it a linear or curvilinear correlation? Is it a positive or negative relationship? How strong is the relationship?

Solving these questions with t scores and r, the estimated correlation coefficient derived from the tx and ty scores of individuals in a random sample.

Correlational research – how to start.

To begin a correlational study, we select a population or, far more frequently, select a random sample from a population.

(Since we use samples most of the time, for the most part, we will use the formulae and symbols for computing a correlation from a sample.)

We then obtain two scores from each individual, one score on each of two variables. These are usually variables that we think might be related to each other for interesting reasons). We call one variable X and the other Y.

Correlational research: comparing tX & tY scoresWe translate the raw scores on the X variable to t

scores (called tX scores) and raw scores on the Y variable to tY scores. So each individual has a pair of scores, a tX score and a

tY score.

You determine how similar or different the tX and tY scores in the pairs are, on the average, by subtracting tY from tX, then squaring, summing, and averaging the tX and tY differences.

The estimated correlation coefficient, Pearson’s r

With a simple formula, you transform the average squared differences between the t scores to Pearson’s correlation coefficient, r

Pearson’s r indicates (with a single number), both the direction and strength of the relationship between the two variables in your sample.

r also estimates the correlation in the population from which the sample was drawn In Ch. 8, you will learn when you can use r that

way.

Going from pairs of raw scores to r: Linearity - A preliminary question.

Once you have scores on two variables, you

ask, “Is this a linear or curvilinear relationship?”

Psychology is a relatively new science and this is an intro stat course For both reasons, you will only learn how to deal with

linear relationships between two variables and save correlation with three or more variables and curvilinear relationships for grad school.

BUT YOU MUST KNOW WHAT A LINEAR RELATIONSHIP IS, AND HOW TO RECOGNIZE A NONLINEAR (CURVILINEAR) CORRELATION.

Linearity vs. Curvilinearity In a linear relationship, as scores on onevariable go from low to high, scores on the other variable either generally increase or generally decrease.

In a curvilinear relationship, as scores on onevariable go from low to high, scores on the other variable change directions. They can go 1.)down and then up, 2.) up and then down, 3.) up and

down and then up again, 4.) up or down then flat. ETC.

Examples of linear relationships.For example, think of the relationship of the size

of a pleasure boat (X) and its cost (Y).As one variable (boat size) increases, scores on the other variable (cost) also increase.

Another example of a linear relationship: the relationship between the size of a car and the number of miles per gallon it gets.

In general, as cars get gradually larger (X), they tend to get fewer miles per gallon (Y).

A curvilinear relationship In a curvilinear relationship, as scores on the

X variable go gradually from low to high, the Y variable changes direction.

For example, think of the relationship between age (X) and height (Y).

As age increases from 0-14 or so, height increases also.

But then people stop growing. As age increases, height stays the same.

Thus the Y variable, height, changes direction. It goes from gradually rising to flat.

If you graph age and height, the best fitting line is a curved line.

Correlation Characteristics: Which line best shows the relationship between age (X) and height (Y)

Linear vs Curvilinear

Another non-linear relationship: shortstops and linemen: great shortstops may be too small to be great football lineman.

Football potentialTerribleAverageAverage

Very GoodExcellent

GoodPoor

Baseball skillTerrible

Very PoorPoor

AverageGood

Very GoodExcellent

DavidBenEdFrankChuckAlGeorge

Is this a linear relationship?

Plot the dots!

To check whether a relationship is linear, make a graph and place the scores on it.

That’s what I mean by “Plot the dots.”

If you really want to know what is going on with data, Plot the dots!

Here is a graph for the baseball skills and football potential data.

When you plot the dots, is this linear?

* Ben* Ed

* Frank

* Chuck

* Al

* David

* George

Excellent

Terrible

Very Good

Good

Average

Poor

Very Poor

ExcellentTerrible Very GoodGoodAveragePoorVery Poor

FootballSkill

BaseballSkill

NO! It is best described bya curved line.It is a curvilinear relationship!

After you know a correlation is linear, there are other two questions: Direction and Strength of a correlation. But first, a definition of high and low scores.

Definition of high and low scores: High scores are scores above the mean.

They are represented by positive t scores. Low scores are scores below the mean of

each variable. They are represented by negative t scores.

Positive relationships In a positive relationship, as X scores gradually

increase, Y scores tend to increase as well.

Example: The longer a sailboat is, the more it tends to cost. As length goes up, price tends to go up.

In a positive correlation, X and Y scores tend to be on the same side of their respective means.

As a result, the tX and tY scores tend to be similar and the difference between them (tX – tY) tends to be small.

Since (tX – tY) is small, the squared difference between them, (tX – tY)2 also tends to be small

Graphing a positive relationship. In a positive correlation high scores on X tend to

go with high scores on Y. On a graph, as the line runs from left to right, scores increase on the X axis. At the same time, Y scores also generally get higher. So, the line will tend to rise as it runs.

Remember from math, slope equals how far a line rises on the Y axis for each unit it moves from left to right or “runs” along the X axis.

If a line rises from left to right, “rise” is positive. Run is always positive. So a positive rise divided by an (always) positive run results in a positive slope. (That’s why we call it a “positive” correlation.)

Positive vs Negative scatterplot

3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Negativerelationship

Positiverelationship

Graphic display of a strong POSITIVE correlation.

3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Negative relationships In a negative relationship, as X scores gradually

increase, Y scores tend to decrease.

Example: The more years a sailboat is used, the less it tends to cost. As use goes up, price tends to go down.

In a negative correlation, X and Y scores tend to be on opposite sides of their respective means.

As a result, the tX and tY scores tend to be dissimilar and the difference between them (tX – tY) tends to be large.

Since (tX – tY) is large, the squared difference between them, (tX – tY)2 also tends to be large.

Graphing a negative relationship

In a negative correlation, high scores on X tend to go with low scores on Y. On a graph, as the line runs from left to right, scores increase on the X axis. At the same time, Y scores get lower. So, the line will tend to fall as it runs.

Remember from math, slope equals how far a line rises on the Y axis for each unit it moves from left to right or “runs” along the X axis.

If a line falls from left to right, “rise” is negative. Run is always positive. So a negative rise divided by an (always) positive run results in a negative slope. (That’s why we call it a “negative” correlation.)

Positive vs Negative scatterplot

3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Negativerelationship

Positiverelationship

Summary:

When t scores are consistently more similar than different, we have a positive correlation. On a graph the dots will rise from your left to your right.

When t scores are consistently more different than similar, we have a negative correlation. On a graph the dots will fall from your left to your right.

Positive vs Negative scatterplot

3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Negativerelationship

Positiverelationship

How strong is the relationship between the tX and tY scores?

Here the question is about the consistency with which tX and tY scores are either similar or dissimilar.

t scores: sign and size

There are two aspects to the consistency of the relationship between tX and tY scores. First, are the t scores consistently of the same

sign (positive correlation) or opposite signs (negative correlation).

If they are almost always one way or the other, you have at least a moderately strong relationship.

On the other hand, if you sometimes see t scores on the same side of the mean and sometimes on opposite sides, you have a relatively weak correlation.

t scores: sign and size If there is a consistent pattern of same

signed t scores (positive correlation) or a consistent pattern of opposite signed t scores (negative correlation), then whether the tX and tY scores are about the same distance from the mean comes into play.

The large majority of t scores (usually well over 95%, range from –2.50 to + 2.50

Given a consistent positive or negative correlation, the more similar in size the t scores, the stronger the correlation.

Positive correlations:Perfect: tX and tY scores are all the same sign

and are identical in size.Strong: tX and tY scores are almost all the same

sign and are fairly similar in size.Moderate: tX and tY scores are predominately

the same sign. This is especially true for pairs in which one of the values is one or more standard deviations from the mean. Size may be fairly dissimilar.

Weak: tX and tY scores are a little more often the same sign than opposite in sign. Nothing can be said about size.

Negative correlations:Perfect: tX and tY scores are all of the opposite

sign and are identical in size.Strong: tX and tY scores are almost all of

opposite sign and are fairly similar in size.Moderate: tX and tY scores are predominately

opposite in sign. This is especially true for pairs in which one of the values is one or more standard deviations from the mean. Size may be fairly dissimilar.

Weak: tX and tY scores are a little more often of opposite signs than the same in sign. Nothing can be said about size.

Unrelated (independent) variables

When the size and sign of the tX scores bears no relationship to the size and sign of the tY scores, the variables are unrelated.

We also can call the variables “independent of” or “orthogonal to” each other. The three terms, unrelated, independent and orthogonal are synonymous in this context.

Graphing it on t axes: The strength of a relationship tells us approximately how the dots representing pairs of t scores will fall around a best fitting line.

Perfect - scores fall exactly on a straight line whose slope will be +1.00 or –1.00.

Strong - most scores fall near the line whose slope will be close to +.750 or -.750.

Moderate - some are near the line, some not. The slope of the line will be close to +.500 or -.500.

Graphing it on t axes: The strength of a relationship tells us approximately how the dots representing pairs of t scores will fall around a best fitting line.

Weak – some scores fall fairly close to the line, but others fall quite far from it. The slope of the line will be close to +.250 or -.250

Independent - the scores are not close to the line and form a circular or square pattern. The best fitting line will be the X axis, a line with a slope of 0.000.

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Perfect

Strength of a relationship3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Very Strong

Strength of a relationship3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Moderate

Strength of a relationship3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Independent

What is this relationship?3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

What is this?3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

What is this?3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

What is this?3

-3

2

1

0

-1

-2

3 -3 2 1 0 -1 -2

Computing the correlation coefficient.

Comparing apples to oranges? Use Z or t scores!

You can use correlation to look for the relationship between ANY two values that you can measure of a single subject.

However, there may not be any relationship (independent).

A correlation tells us if scores are consistently similar on two measures, consistently different from each other, or have no real pattern

Comparing apples to oranges? Use t scores!To compare scores on two different

variables, you transform them into ZX and ZY scores if you are studying a population or tX and tY scores if you have a sample.

ZX and ZY scores (or tX and tY scores) can be directly compared to each other to see whether they are consistently similar, consistently quite different, or show no consistent pattern of similarity or difference

Comparing variables

Anxiety symptoms, e.g., heartbeat, with number of hours driving to class.

Hat size with drawing ability.Math ability with verbal ability.Number of children with IQ.Turn them all into Z or t scores

Pearson’s Correlation Coefficient

coefficient - noun, a number that serves as a measure of some property.

The correlation coefficient indexes BOTH the consistency and direction of a correlation with a single number

Pearson’s rhoPearson’s rho () is the parameter that

characterizes the strength and direction of a linear relationship (and only a linear relationship) between two variables. To compute rho, you must have the entire population. Then you can compute sigma, mu, Z scores and rho.

The formula: rho= 1 -(1/2 (ZX - ZY)2 / (NP)) where NP is the number of pairs of Z scores in the population

In English: The correlation coefficient equals 1 minus half the average squared distance between the Z scores.

Pearson’s rhoWhen you have a perfect positive correlation,

the Z scores will be identical in size and sign. So the average squared distance will be zero and rho = 1.000-1/2(0.000) = 1.000

When you have a perfect negative correlation, the Z scores will be identical in size and opposite in sign. It can be proven algebraically that the average squared distance in that case will be 4.000: rho = 1.000-1/2(4.000) = -1.000

When you have two totally independent variables, the average squared distance will be 2.000 (halfway between 0.000 and 4.000). Thus, rho = 1.000-1/2(2.000) = 0.000

Pearson’s Correlation CoefficientThus, rho varies from -1.000 (perfect negative

correlation to 0.000 (independent variables) to +1.000 (perfect positive correlation).

A negative value indicates a negative relationship; a positive value indicates a positive relationship.

Values of r close to 1.000 or -1.000 indicate a strong (consistent) relationship; values close to 0.000 indicate a weak (inconsistent) or independent relationship.

Estimating rho with r

Computing rho involves finding the actual average squared distance between the ZX and ZY scores in the whole population.

In computing r, we are estimating rho.

The formula for rPearsons r is a least squares, unbiased

estimate of rho, based on the relationships found between tX and tY scores in a random sample.

r = 1 - (1/2 (tX - tY)2 / (nP - 1)) where nP-1 equals one less than the number of pairs of t scores in the sample. In English: Pearson’s r equals 1.000 minus

half the estimated average squared difference between the Z scores in the population based on squared differences between the t scores in the sample.

Look at those formulae again.

rho= 1 -(1/2 (ZX - ZY)2 / (NP)) where NP is the number of pairs of Z scores in the population

(ZX - ZY)2 / (NP) is the average squared distance between the Z scores.

The rest of the formula, simply transforms the average squared distance between the Z scores into a variable that goes from +1.000 to –1.000.

Look at those formulae again.r = 1 - (1/2 (tX - tY)2 / (nP - 1)) where nP-1

equals one less than the number of pairs of t scores in the sample.

REMEMBER, t scores are estimated Z scores . (tX - tY)2 / (nP - 1)) is a least squared, unbiased

estimate of the average squared difference between the Z scores in the population based on the differences between the tX and tY scores in a random sample.

The rest of the formula, simply transforms the estimated average squared distance between the Z scores into a variable that goes from +1.000 to –1.000.

Thus, r, the least squared, unbiased estimate of rho, is basically an estimate of the average squared difference between the ZX and ZY scores in the population transformed into a variable that goes from -1.00 to +1.00.

Similarities of r and rhor and rho vary from -1.000 to +1.000.

For both r and rho, a negative value indicates a negative relationship; a positive value indicates a positive relationship.

Values of r or rho close to 1.000 or -1.000 indicate a strong (consistent) relationship; values close to 0.000 indicate a weak (inconsistent) or independent relationship.

Since we almost always are studying random samples, not populations, we almost always compute Pearson’s r, not Pearson’s rho.

r, strength and direction

Perfect, positive +1.00Strong, positive + .75Moderate, positive + .50Weak, positive + .25Independent .00Weak, negative - .25Moderate, negative - .50Strong, negative - .75 Perfect, negative -1.00

Calculating Pearson’s r

Select a random sample from a population; obtain scores on two variables, which we will call X and Y.

Convert all the scores into t scores.

Calculating Pearson’s r

First, subtract the tY score from the tX score in each pair.

Then square all of the differences and add them up, that is, (tX - tY)2.

Calculating Pearson’s r

Estimate the average squared distance between ZX and ZY by dividing by the sum of squared differences between the t scores by (nP - 1).

(tX - tY)2 / (nP - 1)

To turn this estimate into Pearson’s r, use the formula

r = 1 - (1/2 (tX - tY)2 / (nP - 1))

Example: Calculate t scores for X

DATA2468

10

X=30 N= 5

X=6.00 MSW = 40.00/(5-1) = 10

sX = 3.16

(X - X)2

16404

16

X - X-4-2024

tx=(X-X)/ s

-1.26-0.63 0.00 0.63 1.26

SSW = 40.00

Calculate t scores for Y

DATA9

11101213

Y=55 N= 5 Y=11.00 MSW = 10.00/(5-1) = 2.50

sY = 1.58

(Y - Y)2

40114

Y - Y-2-0-1+1+2

(ty=Y - Y) / s-1.26 0.00-0.63 0.63 1.26

SSW = 10.00

Calculate r

tY

-0.63-1.26-0.63 0.63 1.26

tX

-1.26-0.63 0.00 0.63 1.26

tX - tY

0.00-0.63 0.630.000.00

(tX - tY)2

0.000.400.400.000.00

(tX - tY)2 / (nP - 1)=0.200

r = 1.000 - (1/2 * ( (tX - tY)2 / (nP - 1)))

r = 1.000 - (1/2 * .200) = 1 - .100 = .900

(tX - tY)2=0.80

This is a verystrong, positive relationship.

By the way - True graphs.

Ch.7 has true graphs, displays in which each dot stands for a score on two (in this case) or more (in more advanced cases) variables.

In Ch. 1 through Ch. 6, most of the figures have represented the frequency of scores on a single variable.

Formally, displays of frequencies are figures, but they are not graphs.