[系列活動] Data exploration with modern R

Post on 21-Apr-2017

2,850 views 1 download

transcript

Exploring data with modern R

Winston Chang RStudio

2016–12–21

https://hea-www.harvard.edu/~fine/Observatory/women.html

Modern R

A brief history of R• In the beginning, there was S. Developed at

Bell Labs in the 1970’s.• S was owned and licensed by AT&T• In 1990’s, two professors from New Zealand

created a free, open source reimplementation of S, called R

• Many of the unusual features of R exist because they came from S

• R itself is somewhat different from S and has a very flexible syntax

install.packages("tidyverse") # Automatically installs ggplot2, dplyr, tidyr, # and others.

library(tidyverse) tidyverse_update() # Update all tidyverse pacakges to the latest # version.

The tidyverse

Getting started

faithful

head(faithful)

str(faithful)

View(faithful) # In RStudio

Looking at data with R

●●

● ●

●●

●●

●●

● ●

●●

●●● ●

●●

● ●

●●

50

60

70

80

90

2 3 4 5eruptions

waiting

library(ggplot2)

ggplot(data=faithful, mapping=aes(x=eruptions, y=waiting)) + geom_point()

# More concisely: ggplot(faithful, aes(eruptions, waiting)) + geom_point()

0

5

10

15

20

25

2 3 4 5eruptions

count

0

10

20

30

40

2 3 4 5eruptions

count

ggplot(faithful, aes(x=eruptions)) + geom_histogram()

ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=.25)

Your turnInspect the diamonds data set.With diamonds, make a histogram of the carat variable. Experiment with different bin sizes. What patterns do you see?

Inspect the mpg data set. With mpg, make a scatter plot showing the relationship between displ and hwy.

0

5000

10000

15000

0 1 2 3 4 5carat

count

ggplot(diamonds, aes(x=carat)) + geom_histogram()

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.3)

0

5000

10000

15000

0 1 2 3 4 5carat

count

0

5000

10000

0 1 2 3 4 5carat

count

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.25)

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.01)

0

1000

2000

0 1 2 3 4 5carat

count

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()

20

30

40

2 3 4 5 6 7displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)

10

20

30

40

2 3 4 5 6 7displ

hwy

head(mpg)

str(mpg)

View(mpg)

ggplot(mpg, aes(x=displ, y=hwy, color=drv)) + geom_point()

20

30

40

2 3 4 5 6 7displ

hwy

drv4

f

r

20

30

40

2 3 4 5 6 7displ

hwy

class2seater

compact

midsize

minivan

pickup

subcompact

suv

ggplot(mpg, aes(x=displ, y=hwy, color=class)) + geom_point()

Your turn

What happens if you use shape instead of color?

Run ?geom_smooth to see the documentation. Then remove the confidence region from the model line. What happens if you add a model line and map a variable to color?

Faceting

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_wrap(~class)

suv

minivan pickup subcompact

2seater compact midsize

2 3 4 5 6 7

2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

hwy

●●●●

●●

●● ●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●● ●●

●●

●●●

●●●●●●●

●●●●●●

●●●●

●●●●●●●●●●●

●●●●

●●●

●●●●

●●

●●

●●●

● ●

●●

● ●●●

●●

●●●

●●●

●●

●●

●●●●● ●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

4 5 6 8

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(. ~ cyl)

●●

●●

●● ●●●

●●

●●

●●●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●● ●

●●●●●●

●●

●●

●●

●●

●●

●●● ●

●●●

●●

●●

●●●

●● ●●●●●●

●●●●

●●

●●●●

●●

●●

●●●●

●●●●

●● ●

●●●●

●●●●

●●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

● ●● ●

●●

●●

●●

●●●

●●●●

●●●

●●●●

●●

●● ●●

●●

●●●●●

●●

●● ●

● ●

●●

● ●●

●●●●

●●●●

●●●

4f

r

2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ .)

●●

●●

●●

●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●● ●●

●●

●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●●●

●●●

●●●

●●●

● ●● ●

●●

●●●

●●●

●●●●● ●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●●●

●●●

4 5 6 8

4f

r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ cyl)

cyl

drv

Your turn

Try faceting with a histogram

ggplot2 concepts

Geoms

Points

Lines

Bars

Error bars

Box plot

Aesthetics

Y position

X position

Color

Size

Aesthetics

Mapping data values to aesthetics

2

4

6

8

2 3 4 5 6 7var1

var2

2

4

6

8

2 3 4 5 6 7var1

var2

012345

var3

var1 var2 var3

2 2 53 4 05 8 47 5 1

ggplot(dat, aes(x=var1, y=var2)) + geom_point()

ggplot(dat, aes(x=var1, y=var2, color=var3)) + geom_point()

2

4

6

8

2 3 4 5 6 7var1

var2

Setting data values to aestheticsvar1 var2 var3

2 2 53 4 05 8 47 5 1

ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red")

ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red", size=6)

2

4

6

8

2 3 4 5 6 7var1

var2

Different geoms

2

4

6

8

2 3 4 5 6 7var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_point()

2

4

6

8

2 3 4 5 6 7var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_line()

0

2

4

6

8

2 4 6var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_bar(stat="identity")

Using multiple geoms

2

4

6

8

2 3 4 5 6 7var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_point() + geom_line()

# Equivalent to ggplot(dat) + geom_point(aes(x=var1, y=var2)) + geom_line(aes(x=var1, y=var2))

ggplot() + geom_point(aes(x=var1, y=var2), data=dat) + geom_line(aes(x=var1, y=var2), data=dat)

Default data

Default mapping

Overridedefaults in each

geom

Discrete Continuous

Color Rainbow of colorsGradient from light

blue to dark blue

Size Discrete size stepsLinear mapping

between radius and value

Shape Different shape for each Shouldn’t work

0

2

4

6

A Bvar1

var3

var2●

G0G1G2

0

2

4

6

A Bvar1

var3ggplot(dat2, aes(x=var1, y=var3)) + geom_point()

Mapping discrete variables

ggplot(dat2, aes(x=var1, y=var3, color=var2)) + geom_point()

var1 var2 var3A G1 5B G0 0A G2 4B G1 1A G0 6B G2 3

Data wrangling with modern R

Tidyverse=

Tidy + universe

Source: https://www.flickr.com/photos/rubbermaid/7203340384 Source: http://hubblesite.org/newscenter/archive/releases/2014/27/image/a/

faithful

as.tbl(faithful)

Tibbles

Tidy data

A B C D A B C D

Each variable is in a column

Each observation is in a row

Example of non-tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

Each row has 3 observations

Not Tidy

Converting to tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Not Tidy

Tidy

• filter: Keep rows

• select: Keep columns

• mutate: Add new columns

• arrange: Sort rows

• summarise: Reduce variables

# Traditional R mpg[mpg$hwy > 30, ]

# dplyr filter(mpg, hwy > 30)

Filter: get a subset of rows

# AND filter(mpg, hwy > 30, class == "compact") filter(mpg, hwy > 30 & class == "compact")

# OR filter(mpg, hwy > 30 | class == "compact")

Filter: get a subset of rows

%>%

filter(mpg, hwy > 30)

mpg %>% filter(hwy > 30)

select(filter(mpg, hwy > 30), model, hwy, class)

mpg %>% filter(hwy > 30) %>% select(model, hwy, class)

mpg %>% filter(hwy > 30) %>% select(model, hwy, class) %>% View()

Piping with %>%

# Traditional R mpg[, c("model", "displ", "cyl", "drv", "class", "hwy")]

# dplyr select(mpg, model, displ, cyl, drv, class, hwy)

select(mpg, -manufacturer, -fl)

Select: get a subset of columns

# Traditional R mpg$avg <- (mpg$cty + mpg$hwy)/2

# dplyr mpg %>% mutate(avg = (cty+hwy)/2)

mpg %>% mutate( avg = (cty+hwy)/2, ratio = hwy/cty )

Mutate: add new columns

# Traditional R mpg[order(mpg$hwy), ]

# dplyr arrange(mpg, hwy)

Arrange: sort rows

# Traditional R mean(mpg$hwy) sd(mpg$hwy)

# dplyr summarise(mpg, hwy_m = mean(hwy))

summarise(mpg, hwy_m = mean(hwy), hwy_sd = sd(hwy), cty_m = mean(cty), cty_sd = sd(cty) )

Summarise: reduce variables

summarise ≠ summarize

Group operations

Why is this important?

Summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

value

11.1

data %>% summarise(value = mean(value))

Group-wise summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

subject value

1 10.3

2 9.3

3 12.1

4 12.6

data %>% group_by(subject) %>% summarise(value = mean(value))

Group-wise summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

sex condition value

F cond1 11.9

F cond2 12.5

F cond3 7.9

M cond1 12.9

M cond2 11.8

M cond3 9.7

data %>% group_by(sex, condition) %>% summarise(value = mean(value))

Mutatesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% mutate(norm = value - mean(value))

subject sex condition value norm

1 M cond1 7.9 -3.2

1 M cond2 12.3 1.2

1 M cond3 10.7 -0.4

2 F cond1 6.3 -4.8

2 F cond2 10.6 -0.5

2 F cond3 11.1 0

3 F cond1 9.5 -1.6

3 F cond2 13.1 2

3 F cond3 13.8 2.7

4 M cond1 11.5 0.4

4 M cond2 13.4 2.3

4 M cond3 12.9 1.8

Group-wise mutatesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% group_by(subject) %>% mutate(norm = value - mean(value))

subject sex condition value norm

1 M cond1 7.9 -2.4

1 M cond2 12.3 2

1 M cond3 10.7 0.4

2 F cond1 6.3 -3

2 F cond2 10.6 1.3

2 F cond3 11.1 1.8

3 F cond1 9.5 -2.6

3 F cond2 13.1 1

3 F cond3 13.8 1.7

4 M cond1 11.5 -1.1

4 M cond2 13.4 0.8

4 M cond3 12.9 0.3

Tidying data with tidyr

Converting to tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Not Tidy

Tidy

Converting to tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

gather(data, condition, value, cond1:cond3)

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data

Thank you!