+ All Categories
Home > Documents > Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships:...

Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships:...

Date post: 24-Aug-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
11
1 Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships Distance versus Speed (when travel time is constant). Income (in millions of dollars) versus total assets of banks (in billions of dollars). Distance versus Speed Distance = Speed × Time Suppose time = 1.5 hours Each subject drives a fixed speed for the 1.5 hrs speed chosen for each subject varies from 10 mph to 50 mph Distance does not vary for those who drive the same fixed speed Deterministic relationship
Transcript
Page 1: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

1

Chapter 14 1

Chapter 14

Describing Relationships:Scatterplots and Correlation

Statistical versus DeterministicRelationships

• Distance versus Speed (when traveltime is constant).

• Income (in millions of dollars) versustotal assets of banks (in billions ofdollars).

Distance versus Speed• Distance = Speed × Time• Suppose time = 1.5 hours• Each subject drives a

fixed speed for the 1.5 hrs– speed chosen for each

subject varies from 10 mphto 50 mph

• Distance does not vary forthose who drive the samefixed speed

• Deterministic relationship

Page 2: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

2

Income versus Assets• Income =

a + b×Assets• Assets vary from 3.4

billion to 49 billion• Income varies from

bank to bank, evenamong those withsimilar assets

• Statistical relationship

• A scatter plot shows a linearrelationship if the points follow, moreor less, along a straight line

• Example - heights and weights of 165students in a college statistics course:

Positive association: High values of one variable tend to occur togetherwith high values of the other variable.

Negative association: High values of one variable tend to occur togetherwith low values of the other variable.

Page 3: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

3

One way to remember this: The equation for this line is y = 5.x is not involved.

No relationship:x and y vary independently. Knowing x tells you nothing about y.

The strength of the relationship between the twovariables can be seen by how much variation, orscatter, there is around the main form.

With a strong relationship, youcan get a pretty good estimate

of y if you know x.

With a weak relationship, for anyx you might get a wide range of

y values.

Correlation

• measures the strength and direction ofa linear relationship between twoquantitative variables

Page 4: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

4

• Negative correlation– X ↑ Y↓– X ↓ Y ↑

• X,Y behave “oppositely”

• Positive correlation– X ↑ Y ↑– X ↓ Y ↓

• X,Y behave “similarly”

r• Pearson correlation coefficient (r) describes the

direction and strength of a linear relationship betweentwo variables.

-1 ≤ r ≤ -0.8 strong negative correlation-0.8 < r < -0.2 weak to moderate negative cor.-0.2 ≤ r ≤ 0.2 negligible correlation 0.2 < r < 0.8 weak to moderate positive cor. 0.8 ≤ r ≤ 1 strong positive correlation

Page 5: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

5

Problems with Correlations

• Outliers can inflate or deflatecorrelations

• Groups combined inappropriately maymask relationships (a third variable)– groups may have different relationships

when separated

Not an outlier:The upper right-hand point here isnot an outlier of therelationship—it is what you wouldexpect for this many beers giventhe linear relationship betweenbeers/weight and blood alcohol.

This point is not in line withthe others, so it is an outlierof the relationship.

Outliers

What does “statisticalsignificance” mean?

• 5.Statistics. Of or relating toobservations or occurrences that aretoo closely correlated to beattributed to chance and thereforeindicate a systematic relationship

Page 6: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

6

Strength and StatisticalSignificance

• A strong relationship seen in the sample mayindicate a strong relationship in the population.

• The sample may exhibit a strong relationshipsimply by chance and the relationship in thepopulation is not strong or is zero.

• The observed relationship is considered to bestatistically significant if it is stronger than alarge proportion of the relationships we couldexpect to see just by chance.

Warnings aboutStatistical Significance

• “Statistical significance” does not imply therelationship is strong enough to be considered“practically important.”

• Even weak relationships may be labeledstatistically significant if the sample size is verylarge.

• Even very strong relationships may not be labeledstatistically significant if the sample size is verysmall.

Chapter 15 33

Chapter 15

Describing Relationships:Regression, Prediction, and

Causation

Page 7: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

7

Straight lines

• y = a + bx• a = y intercept• b = slope

(lines: a quick review)

• Slope = ∆y/∆x = rise/run• e.g. slope is - 2, y decreases 2 units for

every one unit increase in x

• y = 3 - 2x

A regression line is a straight line that describeshow a response variable y changes as anexplanatory variable x changes. We often use aregression line to predict the value of y for a givenvalue of x.

Page 8: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

8

Distances between the points andline are squared so all are positivevalues. This is done so thatdistances can be properly added.

The least-squares regression line is the unique linesuch that the sum of the squared vertical (y)distances between the data points and the line isthe smallest possible.

is the predicted y value(y hat)b is the slopea is the y-intercept

!

ˆ y = a + bx

!

ˆ y

The least-squares regression line can beshown to have this equation:

Nobody in the study drank 6.5beers, but by finding the valueof from the regression line forx = 6.5, we would expect ablood alcohol content of 0.094mg/ml.

Making predictionsThe equation of the least-squares regression allows you to predicty for any x within the range studied. This is called interpolating.

!

ˆ y = 0.0144x + 0.0008

mlmgy

y

/ 0944.00008.0936.0ˆ

0008.05.6*0144.0ˆ

=+=

+=

y

Page 9: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

9

Coefficient of Determination(R2)

• Measures usefulness of regression prediction• R2 (or r2, the square of the correlation):

measures the percentage of the variation inthe values of the response variable (y) that isexplained by the regression line• r=1: R2=1: regression line explains all (100%) of

the variation in y• r=.7: R2=.49: regression line explains almost half

(50%) of the variation in y

r = −1r2 = 1

Changes in xexplain 100% ofthe variations in y.

y can be entirelypredicted for anygiven value of x.

r = 0r2 = 0

Changes in xexplain 0% of thevariations in y.

The value(s) ytakes is (are)entirelyindependent ofwhat value xtakes.

Here the change in x onlyexplains 76% of the change iny. The rest of the change in y(the vertical scatter, shown asred arrows) must be explainedby something other than x.

r = 0.87r2 = 0.76

Extrapolation is the use of aregression line for predictionsoutside the range of x valuesused to obtain the line.

This can be a very stupid thingto do, as seen here.

Hei

ght i

n In

ches

Hei

ght i

n In

ches !!!

!!!

Page 10: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

10

Correlation Does Not ImplyCausation

Even very strong correlationsmay not correspond to a real

causal relationship.

Evidence of Causation• A properly conducted experiment

establishes the connection

• Other considerations:– A reasonable explanation for a cause and

effect exists– The connection happens in repeated trials– The connection happens under varying

conditions– Potential confounding factors are ruled out– Alleged cause precedes the effect in time

Reasons for relationships between variables

1. Explanatory variable is the direct cause ofthe response variable

2. The response variable is causing a changein the explanatory variable

3. The explanatory variable is contributing tobut not the sole cause of change in theresponse variable

4. Confounders may exist5. Both variables result from a common cause6. Both variables are changing over time7. The association is coincidence

Page 11: Chapter 14 - stat.wvu.eduamnatsak/ch14-15.pdf · Chapter 14 1 Chapter 14 Describing Relationships: Scatterplots and Correlation Statistical versus Deterministic Relationships •Distance

11

Association and causationIt appears that lung cancer is associated with smoking.

How do we know that both of these variables are not being affected by anunobserved third (lurking) variable?

For instance, what if there is a genetic predisposition that causes people toboth get lung cancer and become addicted to smoking, but the smoking itselfdoesn’t CAUSE lung cancer?

1) The association is strong.2) The association is consistent.3) Higher doses are associated with stronger

responses.4) The alleged cause precedes the effect.5) The alleged cause is plausible.

We can evaluate the association using thefollowing criteria:

Ch 14 & 15 concepts

•Statistical vs. Deterministic Relationships•Statistical Significance•Correlation Coefficient•Problems with Correlations•LS Regression Equation•R2

•Correlation does not imply causation!


Recommended