[系列活動] Data exploration with modern R

transcript

Exploring data with modern R

Winston Chang RStudio

2016–12–21

https://hea-www.harvard.edu/~fine/Observatory/women.html

Modern R

A brief history of R• In the beginning, there was S. Developed at

Bell Labs in the 1970’s.• S was owned and licensed by AT&T• In 1990’s, two professors from New Zealand

created a free, open source reimplementation of S, called R

• Many of the unusual features of R exist because they came from S

• R itself is somewhat different from S and has a very flexible syntax

install.packages("tidyverse") # Automatically installs ggplot2, dplyr, tidyr, # and others.

library(tidyverse) tidyverse_update() # Update all tidyverse pacakges to the latest # version.

The tidyverse

Getting started

faithful

head(faithful)

str(faithful)

View(faithful) # In RStudio

Looking at data with R

●●

● ●

●●

● ●

●●

●●● ●

●●

● ●

●●

2 3 4 5eruptions

waiting

library(ggplot2)

ggplot(data=faithful, mapping=aes(x=eruptions, y=waiting)) + geom_point()

# More concisely: ggplot(faithful, aes(eruptions, waiting)) + geom_point()

2 3 4 5eruptions

ggplot(faithful, aes(x=eruptions)) + geom_histogram()

ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=.25)

Your turnInspect the diamonds data set.With diamonds, make a histogram of the carat variable. Experiment with different bin sizes. What patterns do you see?

Inspect the mpg data set. With mpg, make a scatter plot showing the relationship between displ and hwy.

0 1 2 3 4 5carat

ggplot(diamonds, aes(x=carat)) + geom_histogram()

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.3)

0 1 2 3 4 5carat

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()

2 3 4 5 6 7displ

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)

2 3 4 5 6 7displ

head(mpg)

str(mpg)

View(mpg)

ggplot(mpg, aes(x=displ, y=hwy, color=drv)) + geom_point()

2 3 4 5 6 7displ

class2seater

compact

midsize

minivan

pickup

subcompact

ggplot(mpg, aes(x=displ, y=hwy, color=class)) + geom_point()

Your turn

What happens if you use shape instead of color?

Run ?geom_smooth to see the documentation. Then remove the confidence region from the model line. What happens if you add a model line and map a variable to color?

Faceting

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_wrap(~class)

minivan pickup subcompact

2seater compact midsize

2 3 4 5 6 7

2 3 4 5 6 7 2 3 4 5 6 7

●●●●

●●

●● ●

●●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●

●●

●●●●

●●●

●●

●●●● ●●

●●

●●●

●●●●●●●

●●●●●●

●●●●

●●●●●●●●●●●

●●●●

●●●

●●●●

●●

●●●

● ●

●●

● ●●●

●●

●●●

●●

●●●●● ●

●●

●●●

●●

●●●

●●

● ●●

●●

●●●

●●

●●●●

●●

●●●●

●●

4 5 6 8

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(. ~ cyl)

●●

●● ●●●

●●

●●●●●

●●

● ●

●●

●●●

●●

●●●● ●

●●●●●●

●●

●●● ●

●●●

●●

●●●

●● ●●●●●●

●●●●

●●

●●●●

●●

●●●●

●● ●

●●●●

●●●

●●●●

●●

●●●

●●

●●●

● ●● ●

●●

●●●

●●●●

●●●

●●●●

●●

●● ●●

●●

●●●●●

●●

●● ●

● ●

●●

● ●●

●●●●

●●●

2 3 4 5 6 7

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ .)

●●

●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●●● ●●

●●

●●●●●●

●●●●●●●●●●●

●●

●●●

●●

●●●●

●●

●●●●

●●●

● ●● ●

●●

●●●

●●●●● ●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

● ●●

●●●●

●●●

4 5 6 8

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ cyl)

Your turn

Try faceting with a histogram

ggplot2 concepts

Points

Error bars

Box plot

Aesthetics

Y position

X position

Aesthetics

Mapping data values to aesthetics

2 3 4 5 6 7var1

012345

var1 var2 var3

2 2 53 4 05 8 47 5 1

ggplot(dat, aes(x=var1, y=var2)) + geom_point()

ggplot(dat, aes(x=var1, y=var2, color=var3)) + geom_point()

2 3 4 5 6 7var1

Setting data values to aestheticsvar1 var2 var3

2 2 53 4 05 8 47 5 1

ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red")

ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red", size=6)

2 3 4 5 6 7var1

Different geoms

2 3 4 5 6 7var1

ggplot(dat, aes(x=var1, y=var2)) + geom_point()

2 3 4 5 6 7var1

ggplot(dat, aes(x=var1, y=var2)) + geom_line()

2 4 6var1

ggplot(dat, aes(x=var1, y=var2)) + geom_bar(stat="identity")

Using multiple geoms

2 3 4 5 6 7var1

ggplot(dat, aes(x=var1, y=var2)) + geom_point() + geom_line()

# Equivalent to ggplot(dat) + geom_point(aes(x=var1, y=var2)) + geom_line(aes(x=var1, y=var2))

ggplot() + geom_point(aes(x=var1, y=var2), data=dat) + geom_line(aes(x=var1, y=var2), data=dat)

Default data

Default mapping

Overridedefaults in each

Discrete Continuous

Color Rainbow of colorsGradient from light

blue to dark blue

Size Discrete size stepsLinear mapping

between radius and value

Shape Different shape for each Shouldn’t work

A Bvar1

var2●

G0G1G2

A Bvar1

var3ggplot(dat2, aes(x=var1, y=var3)) + geom_point()

Mapping discrete variables

ggplot(dat2, aes(x=var1, y=var3, color=var2)) + geom_point()

var1 var2 var3A G1 5B G0 0A G2 4B G1 1A G0 6B G2 3

Data wrangling with modern R

Tidyverse=

Tidy + universe

Source: https://www.flickr.com/photos/rubbermaid/7203340384 Source: http://hubblesite.org/newscenter/archive/releases/2014/27/image/a/

faithful

as.tbl(faithful)

Tibbles

Tidy data

A B C D A B C D

Each variable is in a column

Each observation is in a row

Example of non-tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

Each row has 3 observations

Not Tidy

Converting to tidy data

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Not Tidy

• filter: Keep rows

• select: Keep columns

• mutate: Add new columns

• arrange: Sort rows

• summarise: Reduce variables

# Traditional R mpg[mpg$hwy > 30, ]

# dplyr filter(mpg, hwy > 30)

Filter: get a subset of rows

# AND filter(mpg, hwy > 30, class == "compact") filter(mpg, hwy > 30 & class == "compact")

# OR filter(mpg, hwy > 30 | class == "compact")

Filter: get a subset of rows

filter(mpg, hwy > 30)

mpg %>% filter(hwy > 30)

select(filter(mpg, hwy > 30), model, hwy, class)

mpg %>% filter(hwy > 30) %>% select(model, hwy, class)

mpg %>% filter(hwy > 30) %>% select(model, hwy, class) %>% View()

Piping with %>%

# Traditional R mpg[, c("model", "displ", "cyl", "drv", "class", "hwy")]

# dplyr select(mpg, model, displ, cyl, drv, class, hwy)

select(mpg, -manufacturer, -fl)

Select: get a subset of columns

# Traditional R mpg$avg <- (mpg$cty + mpg$hwy)/2

# dplyr mpg %>% mutate(avg = (cty+hwy)/2)

mpg %>% mutate( avg = (cty+hwy)/2, ratio = hwy/cty )

Mutate: add new columns

# Traditional R mpg[order(mpg$hwy), ]

# dplyr arrange(mpg, hwy)

Arrange: sort rows

# Traditional R mean(mpg$hwy) sd(mpg$hwy)

# dplyr summarise(mpg, hwy_m = mean(hwy))

summarise(mpg, hwy_m = mean(hwy), hwy_sd = sd(hwy), cty_m = mean(cty), cty_sd = sd(cty) )

Summarise: reduce variables

summarise ≠ summarize

Group operations

Why is this important?

Summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% summarise(value = mean(value))

Group-wise summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

subject value

1 10.3

3 12.1

4 12.6

data %>% group_by(subject) %>% summarise(value = mean(value))

Group-wise summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

sex condition value

F cond1 11.9

F cond2 12.5

F cond3 7.9

M cond1 12.9

M cond2 11.8

M cond3 9.7

data %>% group_by(sex, condition) %>% summarise(value = mean(value))

Mutatesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% mutate(norm = value - mean(value))

subject sex condition value norm

1 M cond1 7.9 -3.2

1 M cond2 12.3 1.2

1 M cond3 10.7 -0.4

2 F cond1 6.3 -4.8

2 F cond2 10.6 -0.5

2 F cond3 11.1 0

3 F cond1 9.5 -1.6

3 F cond2 13.1 2

3 F cond3 13.8 2.7

4 M cond1 11.5 0.4

4 M cond2 13.4 2.3

4 M cond3 12.9 1.8

Group-wise mutatesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% group_by(subject) %>% mutate(norm = value - mean(value))

subject sex condition value norm

1 M cond1 7.9 -2.4

1 M cond2 12.3 2

1 M cond3 10.7 0.4

2 F cond1 6.3 -3

2 F cond2 10.6 1.3

2 F cond3 11.1 1.8

3 F cond1 9.5 -2.6

3 F cond2 13.1 1

3 F cond3 13.8 1.7

4 M cond1 11.5 -1.1

4 M cond2 13.4 0.8

4 M cond3 12.9 0.3

Tidying data with tidyr

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Not Tidy

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

gather(data, condition, value, cond1:cond3)

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Thank you!

[系列活動] Data exploration with modern R

Data & Analytics