STAT 260: Lecture 9 - stats.otago.ac.nz

Post on 24-Feb-2022

0 views 0 download

transcript

STAT 260: Lecture 9

Mik Black

STAT 260: Lecture 9 Slide 1

More ggplot2. . .

• Today: faceting and lines• As always, don’t forget to call the ggplot2 package before we start:

library(ggplot2)

• And later I also use dplyr:library(dplyr)

• Might not get through all these slides today. . .

STAT 260: Lecture 9 Slide 2

Faceting

• Faceting refers to the technique of making a particular plot across the levels of adiscrete variable (i.e., a factor in R).

• ggplot gives us the ability to do this in a single plot call via the facet_wrap

function.• We’ll look at this functionality using one of the data sets that are part of the

ggplot2 package - the “mpg” data• This is a data set that records the gas mileage of automobiles relative to their other

characteristics.

STAT 260: Lecture 9 Slide 3

MPG data - variables

• manufacturer: name of manufacturer• model: model name• displ: engine displacement, in liters• year: year of manufacture• cyl: number of cylinders• trans: type of transmission• drv (f = front-wheel drive, r = rear wheel drive, 4 = 4wd)• cty: city miles per gallon• hwy: highway miles per gallon• fl: fuel type _ class: “type” of car

STAT 260: Lecture 9 Slide 4

MPG data - structurestr(mpg)

## tibble[,11] [234 x 11] (S3: tbl_df/tbl/data.frame)## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...## $ drv : chr [1:234] "f" "f" "f" "f" ...## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...## $ fl : chr [1:234] "p" "p" "p" "p" ...## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...

STAT 260: Lecture 9 Slide 5

MPG data - first rows

head(mpg)

## # A tibble: 6 x 11## manufacturer model displ year cyl trans drv cty hwy fl class## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~

STAT 260: Lecture 9 Slide 6

MPG data - scatterplot of highway versus city mileageggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point()

10

15

20

25

30

35

20 30 40hwy

cty

STAT 260: Lecture 9 Slide 7

Aside - adding jitter (reminder from last lecture)

• there is a lot of overplotting going on - sometimes adding a little noise improve theplot by making the relationship more obvious (i.e., revealing the overplotted datapoints):

ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position="jitter")

10

20

30

20 30 40hwy

cty

STAT 260: Lecture 9 Slide 8

Colour by vehicle class

• lets use colour to add vehicle class information to the plot:ggplot(data=mpg, aes(x=hwy, y=cty, colour=class)) + geom_point(position="jitter")

10

20

30

20 30 40hwy

cty

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

STAT 260: Lecture 9 Slide 9

Hard to see what is going on. . .

• using colour to denote vehicle class does work, but it is hard to see exactly whatthe relationship is between city and highway mileage for each class.

• this is where “faceting” comes in - we can ask ggplot to make the scatterplot foreach type of vehicle.

• to do this we use the facet_wrap function, along with the ~ operator (you’ll learnmore about this later in the course):

ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +facet_wrap(~class)

STAT 260: Lecture 9 Slide 10

Facet by vehicle classggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +

facet_wrap(~class)

suv

minivan pickup subcompact

2seater compact midsize

20 30 40

20 30 40 20 30 40

10

20

30

10

20

30

10

20

30

hwy

cty

STAT 260: Lecture 9 Slide 11

Facet by vehicle class• we can also specify the number of rows to using for faceting:

ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +facet_wrap(~class, nrow=2)

pickup subcompact suv

2seater compact midsize minivan

10 20 30 40 10 20 30 40 10 20 30 40

10 20 30 40

10

20

30

10

20

30

hwy

cty

STAT 260: Lecture 9 Slide 12

Facet mileage histograms by drive type

• we can use faceting for (almost) any sort of plot:ggplot(data=mpg, aes(x=hwy)) + geom_histogram(bins=15) + facet_wrap(~drv)

4 f r

10 20 30 40 10 20 30 40 10 20 30 40

0

10

20

30

40

hwy

coun

t

STAT 260: Lecture 9 Slide 13

More information: engine displacement

• engine displacement, displ, is a continuous variable:ggplot(data=mpg, aes(x=displ)) + geom_histogram(bins=15, colour='black', fill='white')

0

10

20

30

40

2 4 6displ

coun

t

STAT 260: Lecture 9 Slide 14

Engine displacement• definitely varies by vehicle class:

ggplot(data=mpg, aes(x=class, y=displ)) + geom_boxplot() +geom_jitter(width=0.15, alpha=0.3)

2

3

4

5

6

7

2seater compact midsize minivan pickup subcompact suvclass

disp

l

STAT 260: Lecture 9 Slide 15

Colour by engine displacement• can also colour by a continuous variable (mentioned this at the end of the last

lecture):ggplot(data=mpg, aes(x=hwy, y=cty, colour=displ)) + geom_point(position="jitter")

10

15

20

25

30

35

20 30 40hwy

cty

2

3

4

5

6

7displ

STAT 260: Lecture 9 Slide 16

Facet by vehicle class & colour by displacement• and now lets facet by class!

ggplot(data=mpg, aes(x=hwy, y=cty, colour=displ)) + geom_point(position='jitter') +facet_wrap(~class)

suv

minivan pickup subcompact

2seater compact midsize

20 30 40

20 30 40 20 30 40

10

20

30

10

20

30

10

20

30

hwy

cty

2

3

4

5

6

7displ

STAT 260: Lecture 9 Slide 17

Linking point size to a variable

• instead of colour we could use point size to include information about a variables:ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) + geom_point()

10

15

20

25

30

35

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 18

Linking point size of a variable (alpha)• add transparency via alpha levels:

ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) +geom_point(alpha=0.2)

10

15

20

25

30

35

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 19

Linking point size of a variable (with alpha and jitter)

• now ad some jitter. . .ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) + geom_point(alpha=0.2, position='jitter')

10

20

30

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 20

Local aesthetics

• ggplot allows us to specify aesthetic locally (i.e., specific to a geom).• if the local value is different to the aes values specified in the main ggplot call,

then those aesthetics will be used for that particular geometric object.• this becomes useful when customising multiple layers in a single plot - we’ll see an

example of this later in the lecture.• here is an example of specifying the point size within geom_point (it gives the

same result as above):ggplot(data=mpg, aes(x=hwy, y=cty)) +

geom_point(aes(size=displ), alpha=0.2, position='jitter')

STAT 260: Lecture 9 Slide 21

Local aestheticsggplot(data=mpg, aes(x=hwy, y=cty)) +

geom_point(aes(size=displ), alpha=0.2, position='jitter')

10

15

20

25

30

35

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 22

Adding lines

• another very powerful feature of ggplot is the ability to add lines to a plot.• in particular, lines that are generated by the application of a statistical procedure to

the data in the plot. For example:I linear regressionI local smoothing techniques such as “loess”

• here we are using the geom_smooth geometric object.• if no method is specified, geom_smooth will choose a method based on sample size:“loess” for n<1000, otherwise a generalised additive model is used (don’t worryabout this for now. . . )

• the syntax is:ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth()

STAT 260: Lecture 9 Slide 23

Adding linesggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth()

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 24

Adding lines: straight line• use geom_smooth(method=lm) to fit a linear model (i.e., simple linear regression)

to the data:ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)

10

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 25

Linear regression

• Here the geom_smooth() function is fitting a linear regression, and then addingthat line (and confidence interval, if se=TRUE) to the plot. Let’s check manually:

linreg = lm(hwy ~ displ, data=mpg)summary(linreg)$coefficients

## Estimate Std. Error t value Pr(>|t|)## (Intercept) 35.697651 0.7203676 49.55477 2.123519e-125## displ -3.530589 0.1945137 -18.15085 2.038974e-46

STAT 260: Lecture 9 Slide 26

Add regression line to plot (base R)plot(mpg$displ, mpg$hwy)abline(linreg)

2 3 4 5 6 7

1520

2530

3540

45

mpg$displ

mpg

$hw

y

STAT 260: Lecture 9 Slide 27

Calculating and adding confidence intervals

newx = seq(min(mpg$displ), max(mpg$displ), by = 0.05)conf_interval = predict(linreg, newdata=data.frame(displ=newx),

interval="confidence", level = 0.95)ci = data.frame(newx, conf_interval)head(ci)

## newx fit lwr upr## 1 1.60 30.04871 29.17768 30.91974## 2 1.65 29.87218 29.01686 30.72750## 3 1.70 29.69565 28.85590 30.53540## 4 1.75 29.51912 28.69479 30.34345## 5 1.80 29.34259 28.53352 30.15166## 6 1.85 29.16606 28.37208 29.96005

STAT 260: Lecture 9 Slide 28

Calculating and adding confidence intervalsplot(mpg$displ, mpg$hwy)abline(linreg, col="lightblue")lines(ci$newx, ci$lwr, col="blue", lty=2)lines(ci$newx, ci$upr, col="blue", lty=2)

2 3 4 5 6 7

1520

2530

3540

45

mpg$displ

mpg

$hw

y

STAT 260: Lecture 9 Slide 29

Check against ggplotggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +

geom_smooth(method='lm', se=TRUE) +geom_abline(intercept=linreg$coef[1], slope=linreg$coef[2], colour='red') +geom_line(data=ci, aes(x=newx, y=lwr)) + geom_line(data=ci, aes(x=newx, y=upr))

10

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 30

Adding lines: remove confidence intervalggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(se=FALSE)

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 31

Colour points by class

• It would be useful to colour the points on the plot by vehicle class (2seater,compact etc)

• Intuitively we can do this by setting colour=class.• Works when we only have geom_point - what happens when we also have the

geom_smooth layer in the plot?

STAT 260: Lecture 9 Slide 32

Colour points by class: oops. . .ggplot(data=mpg, aes(x=displ, y=hwy, colour=class)) + geom_point() + geom_smooth()

20

30

40

2 3 4 5 6 7displ

hwy

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

STAT 260: Lecture 9 Slide 33

What happened?

• The colour=class specification in the main ggplot aesthetics was used for allgeometric objects in the plot.

• What if we only want it to apply to geom_point but not geom_smooth?• Remember the example with point size from above. . . ?• We can specify the colour=class aesthetic within geom_point so that it is only

used for that layer:

ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point(aes(colour=class)) +

geom_smooth()

STAT 260: Lecture 9 Slide 34

Local aesthetics to the rescue!ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point(aes(colour=class)) +

geom_smooth()

20

30

40

2 3 4 5 6 7displ

hwy

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

STAT 260: Lecture 9 Slide 35

Lines and facets• we can also add lines to our faceted plots:

ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +geom_smooth(method=lm, se=FALSE) + facet_wrap(~drv, nrow=1)

4 f r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

displ

hwy

STAT 260: Lecture 9 Slide 36

Caution! Faceting and confidence intervals

• When the geom_smooth function is used to add lines (and confidence intervals),the calculations are performed per facet group.

• This can lead to differences to the confidence intervals that are calculated,compared to a regression model fit to the full data set.

I the regression lines will be the sameI the confidence intervals will be different

• This occurs because in the full regression model, all of the data points are used toestimate the standard error, whereas in the per-facet model, only the data pointsfrom that group are used.

STAT 260: Lecture 9 Slide 37

Faceting and confidence intervalsggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +

geom_smooth(method='lm', se=TRUE) + facet_wrap(~drv)

4 f r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

10

20

30

40

displ

hwy

STAT 260: Lecture 9 Slide 38

Close up for the rear wheel drive grouprwd = filter(mpg, drv=="r")ggplot(data=rwd, aes(x=displ, y=hwy)) + geom_point() +

geom_smooth(method='lm', se=TRUE) + xlim(3,7) + ylim(10,30)

10

15

20

25

30

3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 39

Regression model, with drv interaction term

linreg2 = lm(hwy ~ displ*drv, data=mpg)summary(linreg2)$coef

## Estimate Std. Error t value Pr(>|t|)## (Intercept) 30.6831131 1.0960630 27.993933 1.018637e-75## displ -2.8784863 0.2637577 -10.913372 1.392287e-22## drvf 6.6949631 1.5670461 4.272346 2.841696e-05## drvr -4.9033952 4.1821302 -1.172464 2.422346e-01## displ:drvf -0.7243016 0.4979149 -1.454669 1.471361e-01## displ:drvr 1.9550477 0.8147555 2.399552 1.721899e-02

STAT 260: Lecture 9 Slide 40

Confidence intervals on ggplot• Confidence intervals from full regression model (using all data with drv interaction

term: black lines) are narrower than the “per-facet” interval calculated bygeom_smooth.

10

15

20

25

30

3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 41