+ All Categories
Home > Documents > Lecture Notes

Lecture Notes

Date post: 11-Jan-2023
Category:
Upload: uas
View: 0 times
Download: 0 times
Share this document with a friend
67
STAT II-LAVC-DENARO 1 Lecture Notes Lecture Notes Lecture Notes Lecture Notes Part I Part I Part I Part I Introduction to terminology Introduction to terminology Introduction to terminology Introduction to terminology Data Data Data Data- Who, What, When, Where, and Who, What, When, Where, and Who, What, When, Where, and Who, What, When, Where, and Why? Why? Why? Why?
Transcript

STAT II-LAVC-DENARO 1

Lecture NotesLecture NotesLecture NotesLecture Notes

Part IPart IPart IPart I

Introduction to terminologyIntroduction to terminologyIntroduction to terminologyIntroduction to terminology

DataDataDataData---- Who, What, When, Where, and Who, What, When, Where, and Who, What, When, Where, and Who, What, When, Where, and Why?Why?Why?Why?

STAT II-LAVC-DENARO 2

� Sample Size

How many people are in your sample?

� Data

Are values you collect (this could be height, color of walls, count of cars, etc)

� Experimental Units or Subjects Could be individuals or subjects under study

These are the people or objects we gather information about.

What is the discipline

of Statistics?

Descriptive statistics:

Inferential statistics:

Statistics is a way of reasoning,

along with a collection of tools and

methods, designed to help us better

understand the world.

Methods for organizing

and summarizing data

Drawing conclusions about populations based

on sample data.

Based on your

calculations (statistics)

you will infer things about

the population.

You are describing

your data.

Are particular calculations

made from sample data.

STAT II-LAVC-DENARO 3

VARIABLES

CATEGORICAL QUANTITATIVE

NOMINAL:

Categories are

UNorderd

ORDINAL:

Categories are

ORDERed

DISCRETE:

Collection of

isolated points on

the number line

CONTINUOUS:

Any value in an

interval of numbers

on the number line

Example: Example Example Example

Characteristics of

the individuals

that you measure

The values can

vary from

individual to

individual

Places an

individual into a

group or

category

Takes

numerical

values

STAT II-LAVC-DENARO 4

POPULATION:

The entire collection of persons, things or

objects you wish to study.

SAMPLE:

A subset of the population. The sample

should be representative of the entire

population.

POPULATION PARAMETER:

A number or calculation that describes or

summarizes a population.

SAMPLE STATISTIC:

A number or calculation that describes or

summarizes a sample.

Population

Sample

STAT II-LAVC-DENARO 5

Lecture NotesLecture NotesLecture NotesLecture Notes

Part IIPart IIPart IIPart II

Displaying and describing Displaying and describing Displaying and describing Displaying and describing Categorical DataCategorical DataCategorical DataCategorical Data

STAT II-LAVC-DENARO 6

EDA helps us describe the DISTRIBUTION of a variable:

A DISTRIBUTION TELLS US WHAT VALUES A VARIABLE CAN

TAKE AND HOW OFTEN IT TAKES THESE VALUES.

EXPLORATORY

DATA

ANALYSIS

(EDA)

Categorical

Variables

Quantitative

Variables

Before you start

doing a bunch of

calculations it is wise

to explore your data

first.

One Variable Two Variables One Variable

Bar Chart

Pie Chart

Contingency

Table

Boxplot

Stemplot

Two Variables

Histogram

Scatterplot

Frequency

Table

STAT II-LAVC-DENARO 7

DISPLAYING

THE

DISTRIBUTION OF A

CATEGORICAL

VARIABLE

BAR CHART

PIE CHART

X AXIS:

Y AXIS:

CIRCLE:

SECTIONS:

TABLES:

FREQUENCY

TABLES (one variable)

CONTINGENCY

TABLES (two variables)

CATEGORY NAMES

FREQUENCY

or

RELATIVE

FREQUENCY

Category

names/labels

Frequency or

relative

frequency

Prefer RF!

Represents

the entire

data set

Represent

the category

percentages

STAT II-LAVC-DENARO 8

Ex:

An article in the Winter 2003 issue of Chance magazine reported on the Houston

Independent School District’s magnet schools programs. Of the 1755 qualified

applicants, 931 were accepted, 298 were waitlisted, and 526 were turned away for lack of

space.

1. What is the variable of interest?

2. Find the relative frequency distribution of the decisions made

3. Make an appropriate display of these data

4. Interpret your graph.

STAT II-LAVC-DENARO 9

a CONTINGENCY TABLE shows a table of the JOINT DISTRIBUTION of two variables

The rows contain the groups of one variable.

The columns contain the categories of the other variable.

O is the observed frequency in each of the cells.

T is the Total

Category 1 Category 2 Category 3 Category 4

Group 1 O11 O12 O13 O14 TG1

Group 2 O21 O22 O23 O24 TG2

TC1 TC2 TC3 TC4 n

• JOINT FREQUENCY

represent the number of units that are in a particular group AND category.

• JOINT PROPORTION

represent the proportion of units that are in a particular group AND category.

divide the observed frequency by your sample size

• MARGINAL DISTRIBUTION

Just look at the margins of the contingency table

the distribution of the group variable while ignoring the category variable

or

the distribution of the category variable while ignoring the group variable.

The marginal distribution for the category variable is:

Category Category 1 Category 2 Category 3 Category 4

Frequency TC1 TC2 TC3 TC4

The marginal distribution for the group variable is:

Group Group 1 Group 2

Frequency TG1 TG2.

STAT II-LAVC-DENARO 10

EX: Below is a contingency table of frequencies (counts)

Freshman Sophmore Junior Senior

Male 20 25 39 12 96

Female 20 42 23 14 99

40 67 62 26 195

What is the joint frequency of being a Sophmore and Female? _______________

What is the joint frequency of being Male and a Junior? _______________

What is the joint frequency of being a Freshman and Female? _______________

What is the joint frequency of being Female and a Freshman? _______________

What is the probability of being a Sophmore and Female? _______________

What is the probability of being a Sophmore and Male? _______________

What is the probability of being Male and a Junior? _______________

What is the probability of being Female and a Junior? _______________

What is the probability of being a Freshman and Female? _______________

What is the probability of being Female and a Freshman? _______________

The marginal distribution for class standing is: Class Standing Freshman Sophmore Junior Senior

Rel Freq

What is the probability of being a Sophmore? ____________

What is the probability of being a Freshman? ____________

The marginal distribution for gender: Gender Male Female

Rel Freq

What is the probability of being Male? ____________

STAT II-LAVC-DENARO 11

Lecture NotesLecture NotesLecture NotesLecture Notes

Part IIIPart IIIPart IIIPart III

Displaying and describing Displaying and describing Displaying and describing Displaying and describing Quantitative DataQuantitative DataQuantitative DataQuantitative Data

STAT II-LAVC-DENARO 12

When we consider two variables at once we will look at a SCATTERPLOT:

The X axis plots the explanatory variable (independent variable)

The Y axis plots the response variable (dependent variable)

Displaying the

distribution of a

quantitative

variable

Large Datasets Small Datasets

Histograms Boxplots Stemplots

X axis:

Variable of Interest

Y axis:

Frequency or

Relative Frequency

5 Number

Summary

Minimum

Q1

Median

Q3

Maximum

Similar to a

histogram, but

instead of bars we

have the actual

values

STAT II-LAVC-DENARO 13

Other features to consider:

o Gaps: Spaces separating data

o Outliers: Data values that are set far apart from the rest of the body of the distribution

Describing the

Distribution of a

QUANTITATIVE

VARIABLE

SHAPE

CENTER

SPREAD

SHAPE

SYMMETRIC SKEWED

or has OUTLIERS

UNIFORM

NORMAL

POSITIVELY

SKEWED

RIGHT SKEWED

NEGATIVELY

SKEWED

LEFT SKEWED

STAT II-LAVC-DENARO 14

APPROPRIATE

MEASURES OF

CENTER

SYMMETRIC DISTRIBUTION

SKEWED DISTRIBUTION

OR

DISTN WITH OUTLIERS

MEAN

(average)

Not Resistant

MEDIAN

(middle number)

Resistant

APPROPRIATE

MEASURES OF

SPREAD

SYMMETRIC DISTRIBUTION

SKEWED DISTRIBUTION

OR

DISTN WITH OUTLIERS

Standard Deviation

(typical deviation from the mean)

Not Resistant

Inter Quartile Range

(the middle 50% of your data)

Resistant

STAT II-LAVC-DENARO 15

Ex: Regular Stem Plot

The Modern Language Association provides listening tests that measure understanding of

spoken French. The range of scores is 0 to 46. Here are the scores of 21 high school

French teachers at the beginning of an intensive summer course in French (in order).

9 15 20 20 22 23 23 30 30 31 31 32 34 34 35 39 40 42 42 45 46

Create a histogram:

Create a stemplot:

Create a boxplot:

What is the shape of the distribution?

What is the best measure of center for this distribution? WHY?

What is the best measure of spread for this distribution? WHY?

STAT II-LAVC-DENARO 16

Note: For an odd number of values: Single middle value is the median

Ex: 10, 11, 15, 19, 20

Note: For an even number of values we take the average of the two middle values Ex: 10, 11, 15, 19, 20, 80

MEDIAN

Find the

MEDIAN

by hand

The median is

not influenced

by outliers.

It is a

RESISTANT measure of center.

Appropriate measure

of CENTER for

SKEWED

DISTRIBUTIONS

1st:

Put all the data in

order from smallest to largest

2nd:

Find the middle

position

2

1n ++++

STAT II-LAVC-DENARO 17

QUARTILES

Split up the

distribution into

quarters

IQR

Inter Quartile Range

25th

percentile

50th percentile 75

th percentile

The range is quick &

easy, but not very informative.

First

Quartile

Q1

25 % of data

below this

observation

___% of data

above this

observation

Second

Quartile

Q2

50 % of data

below this

observation

___% of data

above this

observation

Third

Quartile

Q3

75 % of data

below this

observation

___% of data

above this

observation

Q3 – Q1

Measures

Spread for

Skewed Dist

STAT II-LAVC-DENARO 18

Ex: Below are the numbers of deaths from tornadoes in the U.S. from 1990 through 2000.

53 39 39 33 69 30 25 67 130 94 40

a) Calculate the median.

b) Calculate the Interquartile range (IQR)

STAT II-LAVC-DENARO 19

Outliers:

Step 1: Determine Quartile 1 and Quartile 3 from the data set.

Step 2: Compute the Interquartile Range (IQR) IQR= Q3-Q1

Step 3: Determine the Fences. Fences serve as cutoff points for determining outliers

Lower Fence = Q1- 1.5(IQR)

Upper Fence= Q3 + 1.5(IQR)

Step 4:

If observation < lower fence then it is an OUTLIER

If observation > upper fence then it is an OUTLIER

If lower fence < observation < upper fence it is NOT an OUTLIER

Outliers

What should be done with outliers? First try to understand them in the context of the data. A

histogram can show how the outlier fits with the rest of the data:

• Is there a large gap between the outlier and the rest of the data?

• Is the outlier a value at the end of a stretched out tail?

• Could the outlier be an error?

What you should NOT do:

• Leave an outlier in place without comment, and proceed as if nothing were unusual.

• Drop an outlier without comment just because it is unusual.

STAT II-LAVC-DENARO 20

Ex: The data set below is the case prices (in dollars) of wines produced by a vineyard in

Napa Valley.

150 135 90 122 128 67 142 140 128 132 127 140 129

STAT II-LAVC-DENARO 21

• The true mean of a population uses the notation µµµµ

MEAN

Find the SAMPLE MEAN

X

by hand

Find the SAMPLE

MEAN

X

by calculator

The mean is

influenced by outliers.

It is a

NON-

RESISTANT measure of

center.

Measures

CENTER

for

SYMMETRIC Distributions

1st:

Add up all of the

data values

2nd:

Divide

the sum by the

sample size

1st:

Put all the data

into a list in your

calculator

2nd:

STATVAR

n

XX n++ ...1

STAT II-LAVC-DENARO 22

Comparing the Mean and the Median:

When we look at distributions, where is the mean and where is the median?

NORMAL “SYMMETRIC” CURVE SKEWED DISTRIBUTIONS

• The mean will be pulled towards the tail of the skewed data.

• For Normal Bell Shaped Distributions, the mean is the more appropriate measure of the

center.

• For Skewed Distributions, the median is the more appropriate measure of the center.

Why use the mean then?

It is easy to do calculations on the mean.

The mean has nice properties that we will use later in this course.

STAT II-LAVC-DENARO 23

STANDARD

DEVIATION

Find the

SAMPLE

STANDARD

DEVIATION

s

Find the

POPULATION

STANDARD

DEVIATION

σσσσ

The standard

deviation is

influenced by outliers.

It is a

NON-

RESISTANT measure of

spread.

Measures

SPREAD

for

SYMMETRIC

Distributions

By hand By hand By Calculator By Calculator

Will not be

asked to find

by hand.

Formula in

text.

1st: Put all the

data into a

list in your

calculator

2nd: Calculate

STATVAR

Not possible.

If you need

σ, it should be given to

you.

Not Possible.

Do not let the

calculator

fool you.

STAT II-LAVC-DENARO 24

Variance is the standard deviation squared.

Standard Deviation is the positive square root of the variance

It is a distance measure so it has to be positive

It is the “typical” distance of the datapoints to the mean.

A standard deviation of zero means ________________________.

Values that are very close together have a small standard deviation.

Values that are very far apart have a large standard deviation.

Ex: 1, 2, 3, 4, 5 Calculate the standard deviation

Ex: 1, 1, 1, 1, 1 Calculate the standard deviation

NOTATION:

Sample Mean X

Population Mean µ

Sample Standard deviation s

Sample Variance s2

Population Standard deviation σσσσ

Population Variance σσσσ2222

STAT II-LAVC-DENARO 25

Suppose we have that a single observation (X) comes from a normal distribution with µ

(population mean) and σ (population standard deviation) both given.

We write that the distribution of X is:

X ~ Normal ( µ , σ )

We say: “X is normally distributed with mean = µ µ µ µ and standard deviation = σ σ σ σ ”

Z-Score calculation: σ

µ−=X

Z

The Z SCORE is unitless.

The Z SCORE tells us how far our observation is from the mean in terms of standard

deviations.

STAT II-LAVC-DENARO 26

Ex: Two friends are training for the Boston marathon. James is training on a hilly jogging

loop. For the general population of runners, the time to complete this loop follows a normal

distribution with a mean of 167 minutes and standard deviation 25 minutes. Rob is training

on a flat jogging route. The time to complete this flat route follows a normal distribution

with a mean of 143 minutes and standard deviation 20 minutes. If it takes James 91 minutes

to complete his loop, and it takes Rob 86 minutes to complete his loop, who is in better

condition?

Draw a picture for each of the distributions.

What is the Z-Score for each runner?

Who is in better condition?

STAT II-LAVC-DENARO 27

Lecture NotesLecture NotesLecture NotesLecture Notes

Part IVPart IVPart IVPart IV

Linear RegressionLinear RegressionLinear RegressionLinear Regression

Scatterplots, correlations, and Scatterplots, correlations, and Scatterplots, correlations, and Scatterplots, correlations, and associationsassociationsassociationsassociations

Regression WisdomRegression WisdomRegression WisdomRegression Wisdom

STAT II-LAVC-DENARO 28

Scatterplots: May show a relationship or an association between two quantitative variables.

We are looking for a LINEAR relationship between our two variables.

Explanatory Variable: ____________________________

Response Variable: ______________________________

Ex: Car Weight and Gas Mileage

weight

mileage

650060005500500045004000350030002500

40

35

30

25

20

15

10

Scatterplot of mileage vs weight

1. What is the explanatory variable?

2. What is the response variable?

3. What is the relationship between the two variables? NOTE: strength and direction

Pearson’s sample correlation coefficient= -0.842

STAT II-LAVC-DENARO 29

Do you see

a positive association… or a negative association?

Scatterplots

Direction of

the

Association

Strength of the

Association

Form

(Linear)

Direction

Of the Association

Positive Negative

As X increases,

Y increases.

As X increases,

Y decreases.

STAT II-LAVC-DENARO 30

a strong relationship… or a weak relationship?

Strength

of the Association

Strong Moderate Weak

Do the points follow a single

stream that is tight to the line

or is there considerable

spread (or variability) around

the line?

Form

Of the Association

Straight

LINEAR = GOOD

Curved scatterplots

Or scatterplots with

patterns.

BAD.

Watch out for unusual

features:

Outliers or groupings

little scatter lots of scatter

� Groupings

� Linear Trend

Non Linear Trend � � Outliers

STAT II-LAVC-DENARO 31

Correlation

Coefficient

r

unit-less

measurement values

between

(and

including)

Both

variables

(X and Y)

must be

quantitative

It measures

the

and

of a linear

relationship.

r=1 means

all the data

points lie on

a

They have a

positive slope

and a positive

association.

r= -1 means

all the data

points lie on

a

They have a

Negative

slope and a

negative

association.

If r=0

the best fitting

line has a

slope of zero

Values of r

close to 0

means that the

linear

relationship is

_________

Correlation

is sensitive

to outliers.

Outliers can

dramatically

change r.

Because we use

z-scores, the

correlation

coefficient does

not change

when

converting to

different units.

There may

be a general

linear trend,

but there is a

lot of

variability around that

trend.

There may

be a

relationship but

it is

NOT

LINEAR

NOT

RESISTANT

to

OUTLIERS

STAT II-LAVC-DENARO 32

r = -1 r = -0.7 r = -0.4 r = 0 r = 0.3 r = 0.8 r = 1

Calculating the Correlation Coefficient: You will not have to calculate the correlation

coefficient by hand. You will need to use your statistical function on your calculator to find

the value and interpret the meaning of r as it relates to the explanatory and response

variables.

2nd STATS ���� ���� CLRDATA ENTER

2nd STATS ���� 2VAR ENTER

DATA “Enter Your Data Now”

STATVAR

STRENGTH

of the

LINEAR

relationship

Negative Positive

STRONG

75.01 −<≤− r

WEAK

035.0 <≤− r

WEAK

35.00 ≤< r

MODERATE

75.035.0 ≤< r

STRONG

175.0 ≤< r

Points fall

exactly on a

straight line

Points fall

exactly on a

straight line

No linear

relationship

(uncorrelated)

STAT II-LAVC-DENARO 33

Ex: Estimate a correlation coefficient and describe the relationship.

1.

______________________________

______________________________________

______________________________________

______________________________________

Ex: The data below represent the number of deaths and the magnitude for six

earthquakes.

A) Graph the data set

B) Calculate the correlation coefficient with your calculator

C) Based on your graph, do you think this is an accurate statistic? Explain

Magnitude (X) 6.6 8.3 6.2 6.6 6.9 7.4

Deaths (Y) 60 503 115 65 62 1

500

400

______________________________ 300

_____________________________________

200 _____________________________________

100 _____________________________________

6.0 6.5 7.0 7.5 8.0 8.5

4540 35

20

19

18

17

16

15

14

Latitude (°S)

Mean January Air Temperatures

for 30 New Zealand Locations

Temperature (°C)

STAT II-LAVC-DENARO 34

CORRELATION

COEFFICIENT

If the slope is

zero, then

r is zero.

If the slope is

positive, then

r is positive.

If the slope is

negative, then

r is negative.

r

Find using your

calculator

Find by taking the

square root of r2

As (x) increases,

(y) increases.

There is a strong

positive linear

relationship

between

(x) and (y).

There is a moderate

positive linear

relationship

between

(x) and (y).

There is a weak

positive linear

relationship

between

(x) and (y).

As (x) increases,

(y) decreases.

There is a strong

negative linear

relationship

between

(x) and (y).

There is a moderate

negative linear

relationship

between

(x) and (y).

There is a weak

negative linear

relationship

between

(x) and (y).

STAT II-LAVC-DENARO 35

Cautions:

CORRELATION simply does NOT imply CAUSATION:

a. May be a coincidence

b. Both variables might be directly influenced by some common underlying

lurking or confounding variable

If the correlation is not strong, predictions will not be accurate.

Extrapolation: making predictions outside of the range for which you have data.

• Do NOT extrapolate ever!

Lurking variable=

A variable that is UNRELATED to the EXPLANATORY and/or RESPONSE

variable

BUT it INFLUENCES the interpretation of the relationship between x and y.

The Linear Model:

A regression line is a straight line that models the relationship between an

variable and a variable. Therefore, it is only useful

when one variable helps to predict the other.

STAT II-LAVC-DENARO 36

Least-square

regression line:

LSRL

Best-fitting line to the data Minimizes the (vertical)

distances of your

observations (data) from

your line

The distances are squared

because some data points

will be larger than the

mean (positive) and some

are smaller than the mean

(negative)

The LSRL describes how a response variable y

changes as an explanatory

variable x changes.

The LSRL predicts a

response, y∧, from a given

explanatory variable, x.

STAT II-LAVC-DENARO 37

Lecture NotesLecture NotesLecture NotesLecture Notes

Part VPart VPart VPart V

Regression WisdomRegression WisdomRegression WisdomRegression Wisdom

STAT II-LAVC-DENARO 38

Residuals

For every

given value of X

We have a

True/observed

data value for y

We have a

predicted value for

y

e = error

Difference between

observed y and

predicted y

Observed – Predicted

It is a “model”

which is not perfect

Some of the data points

might be above the line,

and some might be

below the line.

OVER-

PREDICTIONS

UNDER-

PREDICTIONS

STAT II-LAVC-DENARO 39

Ex: Scatterplot of Systolic Blood Pressure versus Weight (Sample of 12 American Adults).

Weight

SBP

220210200190180170160

170

160

150

140

130

Scatterplot of SBP vs Weight

Pearson correlation of SBP and Weight = 0.971

The regression equation is: y = 1.1 + 0.764(x)

1) Suppose we know that one person in the sample weighs 188 pounds and has a systolic blood pressure of

136.

2) Predict the SBP for a person weighing 188 pounds using your regression model.

3) Find the residual.

STAT II-LAVC-DENARO 40

Ex:

LSRL

Slope-Intercept Form

b a The Least Squares

Regression Line always

passes through the point

( x , y ).

xy ab +=ˆ

b = xay − a = r

sx

sy

When ___x____ is 0,

___y___ is equal

to ____b___

Increasing ____X_____

by 1, is associated with

increasing ____Y_____

by _________a______.

Increasing ____X_____

by 1, is associated with

decreasing ____Y_____

by _________a______.

STAT II-LAVC-DENARO 41

An international distance triathlon consists of a 1.5 km swim, a 40 km bike ride and a 10 km run.

Triathletes are ranked based on their overall finishing times, and some people suggest that an

athlete’s time for the swim has the largest influence on his overall performance. Data from 10

male triathletes who competed in the 2004 Camp Pendleton International Triathlon was analyzed

to produce the results below:

Swim Time (Minutes)

Overall Finishing Time(M

inutes)

454035302520

200

190

180

170

160

150

Scatterplot of Overall Finishing Time vs Swim Time

The regression equation is: Overall Finishing Time = 122 + 1.56 (Swim Time)

1. The correct interpretation of the slope is:

A. Increasing the overall finishing time by 1 minute is associated with increasing the

swim time by 122 minutes.

B. Increasing the swim time by 1 minute is associated with increasing the overall

finishing time by 1.56 minutes.

C. Increasing the swim time by 1 minute is associated with increasing the overall

finishing time by 122 minutes.

D. Increasing the overall finishing time by 1 minute, is associated with increasing the

swim time by 1.56 minutes.

Answer:

2. One athlete completed the swim in 34 minutes.

a. Calculate the predicted finishing time.

Answer:

b. Find the athlete’s actual finish time if the value of the residual for this swim time

is 11 minutes.

Answer:

STAT II-LAVC-DENARO 42

The sum of your residuals is equal to ZERO.

Residual Plot:

Measuring

Predictive

Power

Scatterplot of the

(x, residual)

pairs.

The model is a

good fit if the

plot looks

random.

The x-axis is for

the x values

The y-axis is for

the residuals.

There is a

horizontal line at

y = 0.

Do NOT use

linear regression

if:

Unusually large

values for your

residuals

Non-linear

patterns

(curvature)

Uneven

variation

(Fanning)

Influential

observations

STAT II-LAVC-DENARO 43

Coefficient of

Determination: R2 Measuring

Predictive Power

There is ___________ predictive power.

It is the percent of

variation in (y) that can

be explained by (x).

How accurate

predictions will be.

Is your line a good predictor

of reality?

How accurate

predictions will be.

r2 > 80%

Excellent

50% < r2 < 80%

Good

25% < r2 < 50%

Fair

0% < r2 < 25%

Weak

Describes the

connection between

x and y

STAT II-LAVC-DENARO 44

Example: Does more education result in more crime? Education was measured as the percentage of residents aged at least 25 in the county who had at

least a high school degree. Crime rate was measured as the number of crimes in Florida County

in the past year per 1000 residents. The correlation coefficient between these variables is 0.67.

a) What is the coefficient of determination?

b) Give the definition and describe the strength of the predictive power based on the

guidelines above.

Ex: The scatterplot below shows the progress of world record times (in seconds) for the 10,000-

meter run up to mid-2004.

year

record

200019801960194019201900

2300

2200

2100

2000

1900

1800

1700

1600

1500

Fitted Line Plotrecord = 4915 - 1.594 year

Which of the following correctly identifies the situation above that results in a misleading

regression result:

A. Using correlation and a straight-line equation to describe curvilinear

data.

B. Combining groups inappropriately

C. Allowing outliers to overly influence the results

D. Linear regression cannot be used on time data (years).

Answer:

STAT II-LAVC-DENARO 45

Conditions and cautions

for Linear Regression:

Be careful of different groups

(subgroups) being combined in

your regression.

Extrapolation: Do not make

predictions outside of the range

for which you have data.

Correlation simply does not

imply causation.

Look for unusual points

Leverage

Outliers

Influential Points

May be a coincidence or there

may be a lurking variable.

If the relationship is NOT linear and

the correlation is NOT strong, then

predictions

will NOT be accurate.

X values far from the

mean of X

Any data point that

stands away from the

others.

Removing this point

from the data set results

in a very different

regression model.

Their residuals can

appear to be small.

They may result in

a large residual or

have high

leverage.

STAT II-LAVC-DENARO 46

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80

Variable 1

Variable 2

Ex: Oil production.

The correlation between oil production and year is r = 0 .117.

Is there a relationship between year and oil production?

Is a linear regression appropriate?

______________________

____________________________

____________________________

Ex:

Which statement can be correctly applied to the point X?

X

A. The point X is an outlier B.The point X is an influential point

C. The point X has high leverage D. A & B are true E. A, B & C are

true.

Answer:

STAT II-LAVC-DENARO 47

Lecture NotesLecture NotesLecture NotesLecture Notes

Part VPart VPart VPart VIIII

All About TestingAll About TestingAll About TestingAll About Testing

STAT II-LAVC-DENARO 48

THE DATA:

Grade Gender Counselor cst2007 cst2008 tutor achievement gain

1 1 F Hornas 240 260 20 low 20

2 4 M Thomas 305 298 20 high -7

3 4 F Hornas 250 300 20 high 50

4 6 M Hornas 296 301 20 high 5

5 6 F Hornas 292 302 20 high 10

6 4 M Roberts 288 303 32 high 15

7 6 F Hornas 300 304 32 high 4

8 2 M Rodriguez 240 305 32 high 65

9 3 M Hornas 280 240 32 low -40

10 6 F Hornas 300 330 32 high 30

11 3 F Rodriguez 280 256 32 low -24

12 2 F Roberts 250 280 46 low 30

13 3 F Roberts 284 240 46 low -44

14 7 M Thomas 300 330 46 high 30

15 4 M Hornas 301 315 46 high 14

16 3 M Rodriguez 220 240 46 low 20

17 4 F Rodriguez 250 250 46 low 0

18 4 F Thomas 315 296 50 low -19

19 8 M Thomas 310 292 52 low -18

20 6 M Hornas 288 288 52 low 0

21 5 F Rodriguez 320 330 52 high 10

22 7 M Thomas 301 315 52 high 14

23 3 M Roberts 280 240 56 low -40

24 5 F Thomas 300 300 56 high 0

25 4 M Thomas 301 315 56 high 14

26 3 M Hornas 250 256 56 low 6

27 4 F Rodriguez 303 320 56 high 17

28 2 M Hornas 304 330 58 high 26

29 3 F Hornas 305 322 58 high 17

30 4 M Rodriguez 258 292 58 low 34

31 4 F Rodriguez 301 315 58 high 14

32 1 M Hornas 240 252 58 low 12

33 2 F Rodriguez 240 254 62 low 14

34 3 F Rodriguez 254 255 62 low 1

35 4 M Thomas 211 305 62 high 94

36 4 F Roberts 256 288 66 low 32

37 7 M Thomas 275 310 66 high 35

38 4 M Rodriguez 303 320 66 high 17

39 3 F Thomas 250 240 66 low -10

40 3 M Rodriguez 303 320 68 high 17

41 1 M Roberts 240 296 66 low 56

42 3 M Roberts 222 240 66 low 18

43 2 F Roberts 258 292 66 low 34

44 4 M Thomas 292 290 74 low -2

45 2 M Hornas 258 292 74 low 34

46 1 M Rodriguez 240 303 74 high 63

The data has 58

students.

GRADE:

year in school

GENDER:

male or female

COUNSELOR:

Name of counselor

CST 2007 and CST

2008:

California Standards

Test scores (in English

Language Arts) from

two different academic

years, 2007 and 2008.

TUTOR:

Number of hours in

the tutoring

intervention. Given in

between the two

testing points, in an

after school program.

ACHIEVEMENT:

High if CST scores in

2008 are above the

median otherwise the

achievement is

classified as low.

GAIN:

Difference between

2008 test score and

2007 test score. If

positive, the student

did better in 2008

compared to 2007. If

negative, the student

did worse in 2008

compared to 2007.

STAT II-LAVC-DENARO 49

Barplot: Counselor

> summary(Counselor)

Hornas Roberts Rodriguez Thomas

16 10 16 15

Frequency Table: Gender

F M 26 31

Pie Chart: Gender

Contingency Table: Gender versus Counselor

Counselor

Gender Hornas Roberts Rodriguez Thomas

F 7 5 9 5

M 9 5 7 10

STAT II-LAVC-DENARO 50

Histogram: Number of hours Tutored

Stemplot: Number of hours tutored

The decimal point is 1 digit(s) to the right of the |

2 | 00000

3 | 222222

4 | 666666

5 | 022226666688888

6 | 22266666668

7 | 4446668

8 | 0000000

> summary(tutor)

Min. 1st Qu. Median Mean 3rd Qu. Max.

20.00 46.00 58.00 56.14 68.00 80.00

STAT II-LAVC-DENARO 51

Boxplot: CST Scores for 2007 and 2008

> summary(cst2007)

Min. 1st Qu. Median Mean 3rd Qu. Max.

211.0 240.0 280.0 270.4 300.0 320.0

> summary(cst2008)

Min. 1st Qu. Median Mean 3rd Qu. Max.

222 254 296 286 305 330

STAT II-LAVC-DENARO 52

Scatterplot: Gain in CST score versus the number of hours in tutoring intervention

STAT II-LAVC-DENARO 53

What type of test would you use to answer the following three questions?

1. Are there differences between the pre and post test scores?

2. Does level of intervention involvement (i.e., more hours of tutoring) predict higher

achievement gains?

3. Are there gender differences in achievement?

HYPOTHESIS

TESTING

T-test

Regression

Chi-Square

While there are many types

of inferential statistics, today

we are going to focus on

three that are commonly

used in education and

The basis of the

problem determines

the type of analysis.

Test of means

Test the relationship

between two

quantitative variables

Test the relationship

between two

categorical variables

STAT II-LAVC-DENARO 54

Common Threads of Hypothesis Tests

Ho: Null Hypothesis: This states what is generally believed to be the true

population parameter.

Ha: Alternative Hypothesis: The “research hypothesis”. What we are trying

to prove.

Test Statistic:

Based on what type of test we are conducting.

Calculated from our data.

Test statistics FAR AWAY from ZERO indicate the

outcome measured from the sample data is UNLIKELY

to happen if the null hypothesis is true.

Therefore we REJECT Ho.

Test statistics CLOSE to ZERO indicate the outcome

measured from the sample data is LIKELY to happen if

the null hypothesis is true.

Therefore we FAIL TO REJECT Ho.

Critical Value:

Based on what type of test we are conducting.

A cut off value that is determined from the null hypothesis

and the significance level of your test.

STAT II-LAVC-DENARO 55

Significance Level:

Common values are .01, .05, and .10.

Sets the threshold for REJECTING Ho

P-value:

The p-value measures how much evidence you have against

the null hypothesis.

We compare the p-value to the significance level.

SMALL p-values indicate the outcome measured from

the sample data is UNLIKELY to happen if the null

hypothesis is true. Therefore we REJECT Ho.

LARGE p-values indicate the outcome measured from

the sample data is LIKELY to happen if the null

hypothesis is true. Therefore we FAIL TO REJECT Ho.

STAT II-LAVC-DENARO 56

Student’s t distribution

We can STANDARDIZE

using the T Distribution

Developed by

William S. Gossett

Standard Error

n

SSE x

=

Degrees of Freedom

n-1

SE x

X

nS

Xt

µµ −=

−=

The t score is the number of

standard errors above or below

the mean an observation is.

STAT II-LAVC-DENARO 57

Stating

Hypotheses:

Null Hypothesis

H0

Alternative Hypothesis

HA

This states what is generally

believed to be the true

population parameter.

The population

MEAN

EQUALS ___.

The “research hypothesis”.

What we are trying to prove.

ONE TAILED

TWO TAILED

The population MEAN is

GREATER than _____.

The population MEAN

is LESS than ____.

The population MEAN

does NOT equal

_____.

STAT II-LAVC-DENARO 58

1. Calculate the test statistic

SE x

X

nS

Xt

µµ −=

−=

2. Make a statistical decision and justify the decision based on the p-value:

Since the p-value is small,

we REJECT the null hypothesis.

OR

Since the p-value is large,

we FAIL TO REJECT the null hypothesis.

3. State your conclusion in the context of the problem:

We REJECT the null hypothesis and conclude the

alternative IS TRUE __________.

The results of our sample ARE statistically significant.

There IS sufficient evidence against the null hypothesis.

We FAIL TO REJECT the null hypothesis and conclude the

alternative IS NOT TRUE __________.

The results of our sample ARE NOT statistically significant.

There IS NOT sufficient evidence against the null hypothesis.

STAT II-LAVC-DENARO 59

Below is the summary information for the difference between the test scores for 2008 versus 2007:

> summary(gain)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-44.0 3.0 14.0 15.6 32.0 94.0

> sd(gain)

[1] 25.17997

Calculate the test statistic:

STAT II-LAVC-DENARO 60

1-sided t-test:

Ho: There is no difference between the average CST scores in 2008 and 2007

Ha: The average CST score is higher in 2008 compared to 2007

Below is the 1-sided t-test output:

> t.test(gain, alternative = "greater")

One Sample t-test

data: gain

t = 4.6764, df = 56, p-value = 9.426e-06

alternative hypothesis: true mean is greater than 0

95 percent confidence interval:

10.01835 Inf

sample estimates:

mean of x

15.59649

STAT II-LAVC-DENARO 61

2-sided t-test:

Ho: There is no difference between the average CST scores in 2008 and 2007

Ha: The average CST score is different in 2008 compared to 2007

Below is the 2-sided t-test output:

> t.test(gain, alternative = "two.sided")

One Sample t-test

data: gain

t = 4.6764, df = 56, p-value = 1.885e-05

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

8.915347 22.277636

sample estimates:

mean of x

15.59649

STAT II-LAVC-DENARO 62

If the test statistic is greater than the critical value, REJECT Ho and

conclude the alternative is true. There is a relationship between X and Y

If the test statistic is not greater than the critical value, FAIL TO

REJECT Ho and conclude the alternative is NOT true. There is NOT a

relationship between X and Y

Test for the

SLOPE

Ho: There is no

relationship between

X and Y

df =

n - 1

Test Statistic

Critical Value

HA: There is a

relationship between

X and Y

t =

______a_______

Standard Error for a

Find the critical

value by looking up

the corresponding

significance level

α =

Area of the

Rejection Region

STAT II-LAVC-DENARO 63

> fit.1 <- lm( gain ~ tutor, data )

> fit.1

Call:

lm(formula = gain ~ tutor, data = data)

Coefficients:

(Intercept) tutor

-0.4378 0.2856

> summary(fit.1)

Call:

lm(formula = gain ~ tutor, data = data)

Residuals:

Min 1Q Median 3Q Max

-56.700 -12.700 -0.414 13.587 76.730

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.4378 10.8346 -0.040 0.968

tutor 0.2856 0.1839 1.553 0.126

Residual standard error: 24.87 on 55 degrees of freedom

Multiple R-squared: 0.04203, Adjusted R-squared: 0.02461

F-statistic: 2.413 on 1 and 55 DF, p-value: 0.1261

STAT II-LAVC-DENARO 64

> ## Get the residuals

> resids <- summary(fit.1)$residuals

>

> ## Get a residual plot

>

> plot(data$tutor, resids, xlab = "Number of hours in tutoring

intervention", ylab = "Residuals",

+ main = "Residual Plot")

> abline(h=0)

STAT II-LAVC-DENARO 65

If the test statistic is greater than the critical value, REJECT Ho and

conclude the alternative is true. There is a relationship between the

groups and the categories

If the test statistic is not greater than the critical value, FAIL TO

REJECT Ho and conclude the alternative is NOT true. There is NOT a

relationship between the groups and the categories

Chi Squared Test

Ho: There is no

relationship between

the groups and the

categories

df =

(rows-1)*(cols -1)

Karl Pearson

(1857-1936)

Test Statistic

Critical Value

The idea is to measure

the difference between

the observed

frequencies and

expected frequencies

HA: There is a

relationship between

the groups and the

categories

χ2 =

sum( (O-E)2 / E )

Find the critical

value by looking up

the corresponding

significance level

α =

Area of the

Rejection Region

STAT II-LAVC-DENARO 66

Ho: There is no relationship between gender and achievement

Ha: There is a relationship between gender and achievement

Step 1: We are given the OBSERVED TABLE:

High Low

Female 11 15 26

Male 16 15 31

27 30 57

Step 2: Calculate the EXPECTED

Multiply the row total and the column total and then divide by the sample size

High Low

Female E11 = 26*27/57= 12.31579

E12 = 26*30/57 = 13.68421

Male E21 = 31*27/57 = 14.68421

E22 = 31*30/57 = 16.31579

Step 3: Calculate the TEST STATISTIC

χ2 = (O11 - E11)

2 + (O12 - E12)

2 +

E11 E12

(O21 – E21)2 + (O22 – E22)

2 =

E21 E22

χχχχ2 = (11 - 12.31579)

2 + (15- 13.68421)

2 +

12.31579 13.68421

(16– 14.68421)2 + (15 – 16.31579)

2

14.68421 16.31579

= .1888

Step 4: Calculate the CRITICAL VALUE

df = (2-1)*(2-1) = 1

STAT II-LAVC-DENARO 67

Step 5: Make a STATISTICAL DECISION and INTERPRET in context of the problem

Since our test statistic is greater than the critical value, we REJECT Ho

and conclude the alternative is true. There is a relationship between the

groups and the categories

Since our test statistic is not greater than the critical value, we FAIL TO

REJECT Ho and conclude the alternative is NOT true. There is NOT a

relationship between the groups and the categories


Recommended