dplyr: manipulating your data - WordPress.com · dplyr: manipulating your data Washington...

Post on 30-Jun-2020

5 views 0 download

transcript

dplyr: manipulating your data

Washington University in St. Louis

September 14, 2016

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 1 / 44

1 OverviewData manipulation as a part of the data analysis pipelinedplyr: why its awesomedplyr: how do we use it?

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 2 / 44

Some resources for dplyr

Hadley Wickham’s online tutorials:http://www.r-bloggers.com/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/

Vignettes:https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

RStudio Blog:http://blog.rstudio.org/2014/01/17/introducing-dplyr/

NOTE: Today’s presentation on dplyr is heavily based on materials fromHadley Wickham’s 2014 tutorial! If you would like more in-depth resourcesabout it, I highly recommend going there first. (In other words, I take nocredit for this presentation – all credits to RStudio and Hadley Wickhamfor creating an awesome tutorial on dplyr)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 3 / 44

the circle of data processing life

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 4 / 44

what we will cover in dplyr

World Happiness Data (aka Homework 1.csv)

Single table verbs & grouped summaries

Data pipelines

Joins (two table verbs)

Do

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44

what we will cover in dplyr

World Happiness Data (aka Homework 1.csv)

Single table verbs & grouped summaries

Data pipelines

Joins (two table verbs)

Do

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44

what we will cover in dplyr

World Happiness Data (aka Homework 1.csv)

Single table verbs & grouped summaries

Data pipelines

Joins (two table verbs)

Do

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44

what we will cover in dplyr

World Happiness Data (aka Homework 1.csv)

Single table verbs & grouped summaries

Data pipelines

Joins (two table verbs)

Do

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44

what we will cover in dplyr

World Happiness Data (aka Homework 1.csv)

Single table verbs & grouped summaries

Data pipelines

Joins (two table verbs)

Do

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 5 / 44

some bad news and some good news

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 6 / 44

some bad news and some good news

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 6 / 44

Time for Some RStudio!

Now you’ll want to open up Rstudio & read in the ’Homework 1.csv’dataset

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 7 / 44

Installing and Loading dplyr

If you want to install the package via the command line:>> install.packages("dplyr")

Remember that you’ll also want to load the package:>> library(dplyr)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 8 / 44

World Happiness data

1 Happiness: subjective happiness

2 GDP: Log gross domestic product per capita

3 Support: subjective support from friends

4 Life: healthy life expectancy at birth

5 Freedom: satisfied or dissatisfied with freedom

6 Generosity: donated to charity in past month

7 Corruption: corruption widespread?

Load the all of the data by important the "Homework 1.csv" file.

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 9 / 44

The 5 most important verbs in dplyr

1 filter: keep rows matching criteria

2 select: pick columns by name

3 arrange: reorder rows

4 mutate: add new variables

5 summarise: reduce variables to values

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44

The 5 most important verbs in dplyr

1 filter: keep rows matching criteria

2 select: pick columns by name

3 arrange: reorder rows

4 mutate: add new variables

5 summarise: reduce variables to values

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44

The 5 most important verbs in dplyr

1 filter: keep rows matching criteria

2 select: pick columns by name

3 arrange: reorder rows

4 mutate: add new variables

5 summarise: reduce variables to values

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44

The 5 most important verbs in dplyr

1 filter: keep rows matching criteria

2 select: pick columns by name

3 arrange: reorder rows

4 mutate: add new variables

5 summarise: reduce variables to values

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44

The 5 most important verbs in dplyr

1 filter: keep rows matching criteria

2 select: pick columns by name

3 arrange: reorder rows

4 mutate: add new variables

5 summarise: reduce variables to values

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 10 / 44

Structure

1 First argument is a data frame

2 Subsequent arguments say what to do with data frame

3 Always return a data frame

4 (Never modify in place, you’ll want to assign the output data frameto an object)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44

Structure

1 First argument is a data frame

2 Subsequent arguments say what to do with data frame

3 Always return a data frame

4 (Never modify in place, you’ll want to assign the output data frameto an object)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44

Structure

1 First argument is a data frame

2 Subsequent arguments say what to do with data frame

3 Always return a data frame

4 (Never modify in place, you’ll want to assign the output data frameto an object)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44

Structure

1 First argument is a data frame

2 Subsequent arguments say what to do with data frame

3 Always return a data frame

4 (Never modify in place, you’ll want to assign the output data frameto an object)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 11 / 44

A simple example

df <- data.frame(

color = c("blue","black","blue","blue","black"),

value = 1:5)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 12 / 44

Filter the rows that are blue

filter(df, color == "blue")

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 13 / 44

Filter the rows that are blue

filter(df, color == "blue")

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 13 / 44

Filter based on certain values

filter(df, value %in% c(1,4))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 14 / 44

Filter based on certain values

filter(df, value %in% c(1,4))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 14 / 44

Some more boolean operators

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 15 / 44

Data Exercise: Find all countries...

1 That begin with J (Japan and Jordan)

2 Classified as World 1

3 With Life between 60 and 70

4 That are both World 1 and Life is between 60 and 70

5 Where Corruption was less than Generosity

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44

Data Exercise: Find all countries...

1 That begin with J (Japan and Jordan)

2 Classified as World 1

3 With Life between 60 and 70

4 That are both World 1 and Life is between 60 and 70

5 Where Corruption was less than Generosity

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44

Data Exercise: Find all countries...

1 That begin with J (Japan and Jordan)

2 Classified as World 1

3 With Life between 60 and 70

4 That are both World 1 and Life is between 60 and 70

5 Where Corruption was less than Generosity

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44

Data Exercise: Find all countries...

1 That begin with J (Japan and Jordan)

2 Classified as World 1

3 With Life between 60 and 70

4 That are both World 1 and Life is between 60 and 70

5 Where Corruption was less than Generosity

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44

Data Exercise: Find all countries...

1 That begin with J (Japan and Jordan)

2 Classified as World 1

3 With Life between 60 and 70

4 That are both World 1 and Life is between 60 and 70

5 Where Corruption was less than Generosity

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 16 / 44

Flight Exercise: Find all countries...

1 J < − filter(data, Country == ”Japan” | Country == ”Jordan”)

2 W1 < − filter(data, World==1)

3 life < − filter(Hdata,Life>60 & Life<70)

4 lifeW1 < − filter(data,Life>60 & Life<70 & World == 1)

5 cg < − filter(data, Corruption < Generosity)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 17 / 44

Select

With the ”select()” function, you can pick variables that you are mostinterested. For example, you can treat names of variables like positions.select(df, color)

select(df, -color)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 18 / 44

Your Turn

Read the help for select(). What other ways can you select variables?

Write down (in R) three ways that you can select the two delay variablesin your flight data.

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 19 / 44

5 ways to select your data

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 20 / 44

Many ways, same delays

select(data, c(Happiness,GDP,Support)

select(data, c(Happiness,GDP,Support)

select(data, starts with("G"))

select(data, contains("G"))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 21 / 44

Arrange

The purpose of the ”arrange()” function is to change the order of yourrows.arrange(df, color)

arrange(df, desc(color))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 22 / 44

Your Turn

Order the dataset by departure Happiness and GDP.Which countries were happiest?If we switch the order to GPD and Happiness, what countries are leasthappy?If we order by descending Happiness and GDP, what happens?

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 23 / 44

Arrange Away

arrange(data, Happiness, GDP)

arrange(flights, GPD, Happiness)

arrange(flights, desc(Happiness,GDP))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 24 / 44

Arrange Away

arrange(data, Happiness, GDP)

arrange(flights, GPD, Happiness)

arrange(flights, desc(Happiness,GDP))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 24 / 44

Mutate

”mutate()” will allow you to add new variables as a function of existingvariables.mutate(df, double = 2 * value)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 25 / 44

Mutate with compound statements

mutate() will also allow you to perform additional transformations onnewly created variables. Neat!mutate(df, double = 2 * value, quadruble = 2 * double)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 26 / 44

Your Turn

Reverse score the corruption variable (like in the homework and to yourdata frame), a.

Standardize your new corruption variable and add that to your data frame.

(Hint: you may need to use select() or View() to see your new variable

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 27 / 44

Mutate

data < − mutate(data, Corruption r = Corruption*-1)

arrange(flights, desc(Corruption r))

data < − mutate(data, Corruption z =

scale(Corruption r, center = TRUE, scale = TRUE)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 28 / 44

Mutate

data < − mutate(data, Corruption r = Corruption*-1)

arrange(flights, desc(Corruption r))

data < − mutate(data, Corruption z =

scale(Corruption r, center = TRUE, scale = TRUE)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 28 / 44

Summarise

”summarise()” will give you a 1-row dataframe. This is not particularlyuseful.summarise(df, total = sum(value))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 29 / 44

Group, then summarise

It is much more useful to group your data and then summarise it.by color < − group by(df, color)

summarise(by color, total = sum(value))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 30 / 44

Grouping the World Happiness data

by world < − group by(data,World)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 31 / 44

Summary functions

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 32 / 44

Your Turn!

Now that you understand the group by() function and summarise()function, how might we want to summarise the GDP by World? (Thereare probably many ways to do this).

What is the average and standard deviation of GDP when you group byWorld?

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 33 / 44

Group by date of departure and summarise delays

by world <- group by(data,World)

GDP by World <- summarise(by world, mean = mean(GDP, na.rm =

TRUE, sd = sd(GDP, na.rm = TRUE)

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 34 / 44

Data pipelines

In real data manipulation, you’re probably not going to just use one verb,but you’re going to use multiple verbs at the same time. This is whereyou’ll want to use data pipelines, which link a bunch of functions intoreadable code. So instead of this...

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 35 / 44

Data pipelines

...you can have this!

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 36 / 44

Joining datasets

Sometimes, you will want to join two separate datasets. Like in theexample below, where you want to join two data frames.

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 37 / 44

Create two dataframes

x < − data.frame(

name = c("John", "Paul", "George", "Ringo", "Stuart", "Pete"),

instrument = c("guitar", "bass", "guitar", "drums", "bass",

"drums"))

y < − data.frame( name = c("John", "Paul", "George", "Ringo",

"Brian"), band = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE"))

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 38 / 44

inner join()

Include only rows in both x and y.

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 39 / 44

left join()

Include all of x, and matching rows in y.

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 40 / 44

semi join

Include only rows of x that match y

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 41 / 44

anti join

Include only rows of x that DON’T match y

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 42 / 44

Summary of all the join functions

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 43 / 44

Do function

In the case where none of these functions can do what you want to do tomanipulate the data, you can always use the do() function. It is slower,but more general purpose, and is similar to ddply() and dlply, if you haveused those functions.

(Washington University in St. Louis) dplyr: manipulating your data September 14, 2016 44 / 44