Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | rhoda-lang |
View: | 229 times |
Download: | 1 times |
DESCRIBING RELATIONSHIPS …
RELATIONSHIPS BETWEEN ...
Talk to the person next to you. Think of two things that you believe may be related. For example, height and weight are generally related... The taller the person, generally, the more they weigh.
Write your two numerical categories that you believe are related on the board.
DO YOU BELIEVE THERE IS THERE A RELATIONSHIP BETWEEN...
•TIME SPENT STUDYING AND GPA?
•# OF CIGARETTES SMOKED DAILY & LIFE EXPECTANCY
•SALARY AND EDUCATION LEVEL?
•AGE AND HEIGHT?
•AGE OF AUTOMOBILE AND VALUE OF AUTOMOBILE VALUE?
RELATIONSHIPS
When we consider (possible) relationships between 2 (numeric) variables, the data is referred to as bi-variate data.
There may or may not exist a relationship/an association between the 2 variables.
Does one variable ‘cause’ the other? Caution!
Does one variable influence the other? Or is the relationship influenced by another variable(s) that we are unaware of?
BIVARIATE DATA
Proceed similarly as uni-variate distributions …
Still graph (use model to describe data; scatter plot; LSRL)
Still look at overall patterns and deviations from those patterns (DOFS; Direction, Outlier(s), Form, Strength; or Trends, Strength, Shape)
Still analyze numerical summary (descriptive statistics)
BIVARIATE DISTRIBUTIONSExplanatory variable, x, ‘factor,’ may help predict or explain changes in response variable; usually on horizontal axis
Response variable, y, measures an outcome of a study, usually on vertical axis
BIVARIATE DATA DISTRIBUTIONS
For example ... Alcohol (explanatory) and body temperature (response). Generally, the more alcohol consumed, the higher the body temperature. Still use caution with ‘cause.’
Sometimes we don’t have variables that are clearly explanatory and response.
Sometimes there could be two ‘explanatory’ variables.
Examples: Discuss with a partner for 1 minute
EXPLANATORY & RESPONSE OR TWO EXPLANATORY VARIABLES?
ACT Score and SAT Score
Activity level and physical fitness
SAT Math and SAT Verbal Scores
GRAPHICAL MODELS…
Many graphing models display uni-variate data exclusively (review). Discuss for 30 seconds and share out.
Main graphical representation used to display bivariate data (two quantitative variables) is scatterplot.
SCATTERPLOTSScatterplots show relationship between two quantitative variables measured on the same individuals
Each individual in data appears as a point (x, y) on the scatterplot.
Plot explanatory variable (if there is one) on horizontal axis. If no distinction between explanatory and response, either can be plotted on horizontal axis.
Label both axes. Scale both axes with uniform intervals (but scales don’t have to match)
LABEL & SCALE SCATTERPLOTVARIABLES: CLEARLY EXPLANATORY AND RESPONSE??
CREATING & INTERPRETING SCATTERPLOTS
Let’s collect some data
On board, write your height (in inches) and your weight (in pounds)
Input into Minitab (graph, scatterplot)
INTERPRETING SCATTERPLOTS
Look for overall patterns (DOFS) including:
•direction: up or down, + or – association?
•outliers/deviations: individual value(s) falls outside overall pattern; no outlier rule for bi-variate data –unlike uni-variate data
•form: linear? curved? clusters? gaps?
•strength: how closely do the points follow a clear form? Strong, weak, moderate?
SCATTERPLOTS: NOTE
Might be asked to graph a scatterplot from data
Might need to sketch what’s on Minitab
Doesn’t have to be 100% exactly accurate; do your best
Scaling, labeling: a must!
MEASURING LINEAR ASSOCIATIONScatterplots (bi-variate data) show direction, outliers/ deviation(s), form, strength of relationship between two quantitative variables
Linear relationships are important; common, simple pattern
Linear relationship is strong if points are close to a straight line; weak if scattered about
Other relationships (quadratic, logarithmic, etc.)
HOW STRONG ARE THESE RELATIONSHIPS? WHICH ONE IS STRONGER?
MEASURING LINEAR ASSOCIATION: CORRELATION OR “R”
Eyes are not a good judge
Need to specify just how strong or weak a linear relationship is
Need a numeric measure
Correlation or ‘r’
MEASURING LINEAR ASSOCIATION: CORRELATION OR “R”* Correlation (r) is a numeric measure of direction and strength of a linear relationship between two quantitative variables
• Correlation (r) is always between -1 and 1
• Correlation (r) is not resistant (look at formula; based on mean)
• R doesn’t tell us about individual data points, but rather trends in the data
* Never calculate by formula; use Minitab (dependent on having raw data)
1r1
MEASURING LINEAR ASSOCIATION: CORRELATION OR “R”
r ≈0 not strong linear relationship
r close to 1 strong positive linear relationship
r close to -1 strong negative linear relationship
GUESS THE CORRELATIONWWW.ROSSMANCHANCE.COM/APPLETS
‘March Madness’ bracket-style Guess the Correlation tournament
Number off; randomly choose numbers to match up head-to-head competition/rounds
Look at a scatterplot, each write down your guess on notecards and reveal at same time
Student who is closest survives until the next round
CAUTION… INTERPRETING CORRELATIONNote: be careful when addressing form in scatterplots
Strong positive linear relationship ► correlation ≈ 1
But
Correlation ≈ 1 does not necessarily mean relationship is linear; always plot data!
R ≈ 0.816 FOR EACH OF THESE
CALCULATING CORRELATION “R”
n, x1, x2, etc., , y1, y2, etc., , sx, sy, …
CALCULATING CORRELATION “R”
Let’s calculate r for our height & weight data and determine how weak or strong the linear relationship is with our data
Stat, regression, fitted line
FACTS ABOUT CORRELATION
Correlation doesn’t care which variables is considered explanatory and which is considered response
Can switch x & y
Still same correlation (r) value
CAUTION! Switching x & y WILL change your scatterplot… just not ‘r’
FACTS ABOUT CORRELATION
r is in standard units, so r doesn’t change if units are changed
If we change from yards to feet, r is not effected
+ r, positive association
- r, negative association
FACTS ABOUT CORRELATION
Correlation is always between -1 & 1
Makes no sense for r = 13 or r = -5
r = 0 means very weak linear relationship
r = 1 or -1 means strong linear association
FACTS ABOUT CORRELATION
Both variables must be quantitative, numerical. Doesn’t make any sense to discuss r for qualitative or categorical data
Correlation is not resistant (like mean and SD). Be careful using r when outliers are present
FACTS ABOUT CORRELATION
r isn’t enough! … mean, standard deviation, graphical representation
Correlation does not imply causation; i.e., # students who own cell phones and # students passing AP exams
ABSURD EXAMPLES… CORRELATION DOES NOT IMPLY CAUSATION…
Did you know that eating chocolate makes winning a Nobel Prize more likely? The correlation between per capita chocolate consumption and the number of Nobel laureates per 10 million people for 23 selected countries is r = 0.791
Did you know that statistics is causing global warming? As the number of statistics courses offered has grown over the years, so has the average global temperature!
LEAST SQUARES REGRESSION
Last section… scatterplots of two quantitative variables
r measures strength and direction of linear relationship of scatterplot
LEAST SQUARES REGRESSION
BETTER model to summarize overall pattern by drawing a line on scatterplot
Not any line; we want a best-fit line over scatterplot
Least Squares Regression Line (LSRL)
LEAST-SQUARES REGRESSION LINE
LEAST SQUARES REGRESSION (PREDICTS VALUES)
LSRL Model:
is predicted value of response variable
a is y-intercept of LSRL
b is slope of LSRL; slope is predicted (expected) rate of change
x is explanatory variable
LEAST SQUARES REGRESSION(PREDICTS VALUES)Often will be asked to interpret slope of LSRL & y-intercept, in context
Caution: Interpret slope of LSRL as the predicted or average change or expected change in the response variable given a unit change in the explanatory variable
NOT change in y for a unit change in x; LSRL is a model; models are not perfect
LSRL: OUR DATA
Go back to whole-class data on height and weight
Now let’s put our LSRL on our scatterplot & determine the equation of the LSRL
Minitab: stat, regression, fitted line plot
LSRL: OUR DATALook at graph of our LSRL for our data
Look at our LSRL equation for our data
Our line fits scatterplot well (best fit) but not perfectly
Make some predictions… what if our height was … what if our weight was …
Interpret our y-intercept; does it make sense? Interpretation of our slope?
ANOTHER EXAMPLE… VALUE OF A TRUCK
TRUCK EXAMPLE…
Suppose we were given the LSRL equation for our truck data as
We want to find a more precise estimation of the value if we have driven 100,000 miles. Use the LSRL equation.
Using graph, estimate price if we have driven 40,000 miles. Then use the above LSRL equation to calculate the predicted value of the truck.
AGES & HEIGHTS…Age (years) Height (inches)
0 18
1 28
4 40
5 42
8 49
LET’S REVIEW FOR A MOMENT, SHALL WE …
Input into Minitab
Create scatterplot and describe scatterplot (what do we include in a description?)
Calculate r (btw, different from slope; why?), equation of LSRL; interpret equation of LSRL in context; does y-intercept make sense?
Based on this data, make a prediction as to the height of a person at age 25.
LSRL: OUR DATA
Extrapolation: Use of a regression line for prediction outside the range of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate.
Friends don’t let friends extrapolate!
CALCULATING THE EQUATION OF THE LSRL: WHAT IF WE DON’T HAVE THE RAW DATA?
We still can calculate the equation for the LSRL, but a little more time consuming
Note: Every LSRL goes through the point (, )
Formula for slope of LSRL:
LSRL:
CALCULATING THE EQUATION FOR THE LSRL: WHAT IF WE DON’T HAVE THE RAW DATA?
Equation of LSRL:
If you do not have raw data, but still need to calculate a LSRL, you will be given:
, ,
Remember, ( is an ordered pair that is on the graph of the LSRL
EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA)
•= a + b (# of beers consumed)
(equation of LSRL in context – better than x & y)
Remember, slope formula of LSRL:
Givens:
Calculate slope for equation of LSRL
EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA)
= a + b (# of beers consumed)
Givens:
,
So, slope = b = .0179
Remember, equations of all LSRL’s go through … so what’s next?
EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA)
= a + b (# of beers consumed)
Givens:
,
Substitute ( into equation
EXAMPLE: CREATING EQUATION OF LSRL(WITHOUT RAW DATA)
0.07375 = a + (.0179) ( 4.8125) and solve for ‘a’
= a + b (# of beers consumed)
= -0.0123 + 0.0179 (# of beers consumed)
INTERPRETING SOFTWARE OUTPUT…
Age vs. Gesell Score
DETOUR… MEMORY MONDAY (OR WAY-BACK WEDNESDAY)…
What is r? What is r’s range?
r tells us how linear (and direction) scatterplot is. ‘r’ ranges from -1 to 1. ‘r’ describes the scatterplot only (not LSRL)
NOW…
We need a numerical measurement that tells us how well the LSRL fits
Coefficient of Determination, or
COEFFICIENT OF DETERMINATION …
Do all the points on the scatterplot fall exactly on the LSRL?
Sometimes too high and sometimes too low
Is LSRL a good model to
use for a particular data
set?
How well does our model
fit our data?
COEFFICIENT OF DETERMINATION OR
“R-sq” software output
Always
Never calculate by hand; always use Minitab
No need to memorize formula; trust me … it’s ugly
COEFFICIENT OF DETERMINATION OR
Remember “r” correlation, direction and strength of linear relationship of scatterplot
, coefficient of determination, fraction of the variation in the values of y that are explained by LSRL, describes to LSRL
COEFFICIENT OF DETERMINATION OR
Interpretation of :
We say, “x% of variation in (y variable) is explained by LSRL relating (y variable) to (x variable).”
FACTS TO REMEMBER ABOUT LSRL
Distinction between explanatory and response variables.
If switched, scatterplot changes and LSRL changes (but what doesn’t change?)
LSRL minimizes distances from data points to line only vertically
FACTS TO REMEMBER ABOUT LSRL
Close relationship between correlation (r) and slope of LSRL; but r and b are (often) not the same; when would r and b have the same value?
LSRL always passes through (
Don’t have to have raw data to identify the equation of LSRL
FACTS TO REMEMBER ABOUT LSRL
Correlation (r) describes direction and strength of straight-line relationships in scatterplots
Coefficient of determination () is the fraction of variation in values of y explained by LSRL
CORRELATION & REGRESSION WISDOM
Which of the following scatterplots has the highest correlation?
CORRELATION & REGRESSION WISDOM
All r = 0.816; all have same exact LSRL equation
Lesson: Always graph your data! … because correlation and
regression describe only linear relationships
CORRELATION & REGRESSION WISDOM
Correlation and regression describe only linear relationships
CORRELATION & REGRESSION WISDOMCorrelation is not causation! Association does not imply causation… want a Nobel Prize? Eat some chocolate! How about Methodist ministers & rum imports?
Year Number of Methodist Ministers in New England
Cuban Rum Imported to Boston (in # of barrels)
1860 63 8,376
1865 48 6,506
1870 53 7,005
1875 64 8,486
1890 85 11,265
1900 80 10,547
1915 140 18,559
BEWARE OF NONSENSE ASSOCIATIONS…r = 0.9749, but no economic relationship between these variables
Strong association is due entirely to the fact that both imports & health spending grew rapidly in these years.
Common year is other variable.
Any two variables that both increase
over time will show a strong association.
Doesn’t mean one explains the other
or influences the other
CORRELATION & REGRESSION WISDOM
Correlation is not resistant; always plot data and look for unusual trends.
… what if Bill Gates walked into a bar?
CORRELATION & REGRESSION WISDOM
Extrapolation! Don’t do it… ever.
Example: Growth data from children from age 1 month to age 12 years … LSRL
What is the predicted height of a 40-year old?
OUTLIERS & INFLUENTIAL POINTS
All influential points are outliers, but not all outliers are influential points.
OUTLIERS & INFLUENTIAL POINTSOutlier: observation lies outside overall pattern
Points that are outliers in the ‘y’ direction of scatterplot have large residuals.
Points that are outliers in the
‘x’ direction of scatterplot may
not necessarily have large
residuals.
OUTLIERS & INFLUENTIAL POINTS
Influential points/observations: If removed would significantly change LSRL (slope and/or y-intercept)
CLASS ACTIVITY…Groups of 2 or 3; measure each other’s head circumferences & arm spans (both in inches, rounded to the nearest ½ “). Write data on board
Create scatterplot and describe the association between head circumference and arm span.
Is a regression line appropriate for our data? Why or why not? If so, create LSRL graph & equation, calculate the correlation and the coefficient of determination
Interpret the slope and the y-intercept of the LSRL
What does it mean if a point falls above the LSRL? Below the LSRL?