AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you...

transcript

R E - E X P R E S S I N G D A T A ( P A R T 2 )

C H A P 9

AP Statistics1

The only statistics you can trust are those you falsified yourself.

Sir Winston Churchill (1874 - 1965)(Attribution to Churchill is ironically falsified)

Goal of Re-expression2

Make the distribution of a variable more symmetric:

A symmetric distribution can be analyzed much more easily than a skewed distribution.

Make the spread of several groups more alike:

With similar spreads, distributions are easier to compare.

Make the form of a scatterplot more linear:

Linear regression is easy – non-linear regression is not!

Make the scatter in a scatterplot spread out evenly rather than following a fan-shape:

An even scatter is a necessary condition for analysis we will learn about later.

What Transformation?6

Ratios of two quantities (e.g., mph) often benefit from a reciprocal.

The reciprocal of the data

An uncommon re-expression, but sometimes useful.

Reciprocal square root

Measurements that cannot be negative often benefit from a log re-expression.

We’ll use logarithms here

“0”

Counts often benefit from a square root re-expression.

Square root of data values

Data with positive and negative values and no bounds are less likely to benefit from re-expression.

Raw data1

Try with unimodal distributions that are skewed to the left.

Square of data values

CommentNamePower

When in doubt, start here:

Ladder of Powers (see p 237)

Important Models7

Exponential Model: 0 1ˆlog y b b x

2 4 6 8 10

Original Data

2 4 6 8 102

Transformed Data

This is the zero power on the ladder. It is useful for values that grow (or shrink) by percentages.

Important Models8

Logarithmic Model: 0 1ˆ logy b b x

Data with a wide range of x-values or with a scatterplot that is very steep at the left and levels out towards the right.

0 10000 20000

Original Data

0.0 1.0 2.0

Transformed Data

log(x)

Important Models9

Power Model: 0 1ˆlog logy b b x

The authors of the textbook call this one the Goldilocks Model – when steps on the ladder are either too big or too small.

2 4 6 8 10

Original Data

0.0 1.0 2.00

Transformed Data

log(x)

Example10

Below are data from 12 perch caught in a lake in Finland (length in cm and weight in grams).

Length

Weight

Length

Weight

8.8 5.9 28.7 300.0

19.2 100.0 30.1 300.0

22.5 110.0 39.0 685.0

23.5 120.0 41.4 650.0

24.0 150.0 42.5 820.0

25.5 145.0 46.6 1000.0

Example11

In order to create a model to predict weight from length, start by looking at the data:

There is a fairly strong, positive, and nonlinear association between weight and length.

Example12

Example13

We need to transform the data (one or both variables) to achieve a more linear relationship. In the biological sciences, power models are fairly common, so we’ll start there.

Take the logarithm of both variables (either base-10 or base-e log –we don’t care which).

The association between the logs of the variables is quite linear.

Example14

Create a linear model, and then check the residuals to determine if the model may be reasonable. Note – you can’t use either R or R-squared to determine if your model is reasonable. These statistics are only useful after you assess the model fit.

Regression Analysis: log(W) versus log(L)

The regression equation is

log(W) = - 2.06 + 3.05 log(L)

Predictor Coef SE Coef T P

Constant -2.0596 0.1498 -13.75 0.000

log(L) 3.0538 0.1037 29.44 0.000

S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%

Example15

Linear Model –remember your calculator doesn’t know you are using log-transformed data when it produces the equation.

log weight 2.06 3.05log length

The residuals appear to be fairly random, so this linear model is reasonably appropriate.

Example16

Describe what the slope represents:

For every one-unit increase in the log of length, the log of weight increases by about 3.05.

Example17

Describe what the correlation represents:

The correlation is the square root of R-squared, which is about 0.994. This indicates there is a very strong, positive, linear relationship between the logs of weight and length.

Constant -2.0596 0.1498 -13.75 0.000

log(L) 3.0538 0.1037 29.44 0.000

S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%

Example18

Describe what R-squared represents:

About 98.9% of the variability in the log of weight is accounted for by the regression with the log of length.

Constant -2.0596 0.1498 -13.75 0.000

log(L) 3.0538 0.1037 29.44 0.000

S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%

Example19

Use the model to predict the weight of a perch that is 35 cm long.

The predicted weight for a 35 cm perch is about 446 grams.

log weight 2.06 3.05log 35

log weight 2.649

2.649weight 10

weight 445.66

What Can Go Wrong?20

• Don’t expect the re-expressed model to be perfect.

• Don’t use R or R-squared to decide which is the best model.

• A transformation won’t make a multimodal distribution unimodal.

• You can’t transform data into a linear form if the scatterplot rises and falls in a cyclical manner.

• If your data has values of zero or that are negative, some transformations can’t be done (logs, for example). Sometimes, if the negative data are close to zero, you can add a very small constant (1/2 and 1/6 are common) to all data values to make them all positive.

• If you have data that are dates (years), pick a reference year to be zero, and look at years from the point forward.

What Can Go Wrong?21

• Keep the model simple – avoid making multiple transformations on the same variable, or mixing quite different transformations on both variables.

• Stay close to the ladder of powers.

Assignment22

Read Chapter 9

Exercises #15, 17-20, 25

xkcd.com

AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you...

Documents