Class 8: Data wrangling I
February 15, 2018
These slides are licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
General
2 / 24
Annoucements
Reading for next class: R for Data Science - chapters 4 (short) and 5
Homework 1 posted on website, http://spring18.cds101.com, due Friday, February23rd by 11:59pm
RStudio cheatsheet resource
Will post cheatsheets on website soon
3 / 24
What is data wrangling?
4 / 24
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
5 / 24
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)
5 / 24
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)
Also encompasses the act of transforming data as a processing step to facilitateanalysis
5 / 24
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)
Also encompasses the act of transforming data as a processing step to facilitateanalysis
Informal word, but data scientists will understand what you mean if you use it
5 / 24
The word "wrangle"
Source: Digital image of a cowboy wrangling data, Digital image on likelihoodlog.com, accessed September 20, 2017,http://www.likelihoodlog.com/?p=1151
5 / 24
ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized
6 / 24
ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized
It would be nice if all datasets were like this! ...but they're the exceptions to therule
6 / 24
ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized
It would be nice if all datasets were like this! ...but they're the exceptions to therule
Most raw datasets need cleaning, and this is where data scientists will spend mostof their time
Source: Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Digital image on forbes.com,accessed September 20, 2017, https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
6 / 24
The "data wrangling" pipeline
7 / 24
The "data wrangling" pipeline
import → obtain data and get it into R
7 / 24
The "data wrangling" pipeline
import → obtain data and get it into R
tidy → reshape rows and columns to follow the Tidy data rules
7 / 24
The "data wrangling" pipeline
import → obtain data and get it into R
tidy → reshape rows and columns to follow the Tidy data rules
transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.
7 / 24
The "data wrangling" pipeline
import → obtain data and get it into R
tidy → reshape rows and columns to follow the Tidy data rules
transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.
7 / 24
Data wrangling in R
8 / 24
A few bits of R history
The �rst stable version of R, v1.0.0, was released on February 29, 2000.
R itself is an implementation of the S programming language, which was designedat Bell Laboratories in the mid-1970s.
Base R was built for statisticians and for doing data analysis, but not necessarilyfor modern Data Science
It's age and legacy brings along old implementations of data structures andabbreviated function (commands) names
Source: David Smith, Over 16 years of R project history, Revolutions blog, last updated on March 4, 2016, accessed September20, 2017, http://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html
9 / 24
Modernizing R with tidyverse
Over the last 3 years, chief scientist at RStudio, Hadley Wickham, has brought Rinto the modern era with the tidyverse .
The tidyverse is an opinionated collection of R packages designed for data science.All packages share an underlying philosophy and common APIs.— Front page of the Tidyverse website
In practice, this meant reducing everything to a small, core set of commands thatall behave in a similar way.
10 / 24
Core tidyverse
ggplot2 : ggplot2 is a system for declaratively creating graphics, based on TheGrammar of Graphics. You provide the data, tell ggplot2 how to map variables toaesthetics, what graphical primitives to use, and it takes care of the details.
dplyr : dplyr provides a grammar of data manipulation, providing a consistentset of verbs that solve the most common data manipulation challenges.
tidyr : tidyr provides a set of functions that help you get to tidy data. Tidy datais data with a consistent form: in brief, every variable goes in a column, and everycolumn is a variable.
Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/
11 / 24
Core tidyverse
readr : readr provides a fast and friendly way to read rectangular data (like csv,tsv, and fwf). It is designed to �exibly parse many types of data found in the wild,while still cleanly failing when data unexpectedly changes.
purrr : purrr enhances R's functional programming (FP) toolkit by providing acomplete and consistent set of tools for working with functions and vectors. Onceyou master the basic concepts, purrr allows you to replace many for loops withcode that is easier to write and more expressive.
tibble : tibble is a modern re-imaginging of the data frame, keeping what timehas proven to be effective, and throwing out what it has not. Tibbles aredata.frames that are lazy and surly: they do less and complain more forcing you toconfront problems earlier, typically leading to cleaner, more expressive code.
Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/
11 / 24
dplyr package
12 / 24
Get copy of dplyr demo repository
On website, spring18.cds101.com, click Materials → Class 8
Obtain a copy of the linked repository
Load in RStudio, and follow along in demos
13 / 24
select()
14 / 24
select() demo
Follow along in RStudio
15 / 24
%>% asideInstead of this:
We write this:
Show the order of transformations
Useful when we have to chain together many transformations!
select(presidential, name, party)
presidential %>% select(name, party)
16 / 24
arrange()
17 / 24
arrange() demo
Follow along in RStudio
18 / 24
slice()
19 / 24
slice() demo
Follow along in RStudio
20 / 24
filter()
21 / 24
Comparisons
Simple comparisons can be made using the following symbols:
> : greater than
>= : greater than or equal to
< : less than
<= : less than or equal to
!= : not equal
== : equal
22 / 24
Logical operators
Source: Digital image of logical operations, R for Data Science website, accessed September 20, 2017,http://r4ds.had.co.nz/transform.html#logical-operators
23 / 24
filter() demo
Follow along in RStudio
24 / 24