+ All Categories
Home > Documents > S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹...

S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹...

Date post: 09-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
35
February 15, 2018 These slides are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Transcript
Page 1: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Class 8: Data wrangling I

February 15, 2018

These slides are licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.

Page 2: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

General

2 / 24

Page 3: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Annoucements

Reading for next class: R for Data Science - chapters 4 (short) and 5

Homework 1 posted on website, http://spring18.cds101.com, due Friday, February23rd by 11:59pm

RStudio cheatsheet resource

Will post cheatsheets on website soon

3 / 24

Page 4: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

What is data wrangling?

4 / 24

Page 5: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The word "wrangle"wrangle

verb

to tend or round up (cattle, horses, or other livestock).— dictionary.com

5 / 24

Page 6: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The word "wrangle"wrangle

verb

to tend or round up (cattle, horses, or other livestock).— dictionary.com

So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)

5 / 24

Page 7: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The word "wrangle"wrangle

verb

to tend or round up (cattle, horses, or other livestock).— dictionary.com

So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)

Also encompasses the act of transforming data as a processing step to facilitateanalysis

5 / 24

Page 8: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The word "wrangle"wrangle

verb

to tend or round up (cattle, horses, or other livestock).— dictionary.com

So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)

Also encompasses the act of transforming data as a processing step to facilitateanalysis

Informal word, but data scientists will understand what you mean if you use it

5 / 24

Page 9: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The word "wrangle"

Source: Digital image of a cowboy wrangling data, Digital image on likelihoodlog.com, accessed September 20, 2017,http://www.likelihoodlog.com/?p=1151

5 / 24

Page 10: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized

6 / 24

Page 11: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized

It would be nice if all datasets were like this! ...but they're the exceptions to therule

6 / 24

Page 12: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized

It would be nice if all datasets were like this! ...but they're the exceptions to therule

Most raw datasets need cleaning, and this is where data scientists will spend mostof their time

Source: Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Digital image on forbes.com,accessed September 20, 2017, https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/

6 / 24

Page 13: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The "data wrangling" pipeline

7 / 24

Page 14: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The "data wrangling" pipeline

import → obtain data and get it into R

7 / 24

Page 15: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The "data wrangling" pipeline

import → obtain data and get it into R

tidy → reshape rows and columns to follow the Tidy data rules

7 / 24

Page 16: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The "data wrangling" pipeline

import → obtain data and get it into R

tidy → reshape rows and columns to follow the Tidy data rules

transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.

7 / 24

Page 17: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

The "data wrangling" pipeline

import → obtain data and get it into R

tidy → reshape rows and columns to follow the Tidy data rules

transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.

7 / 24

Page 18: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Data wrangling in R

8 / 24

Page 19: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

A few bits of R history

The �rst stable version of R, v1.0.0, was released on February 29, 2000.

R itself is an implementation of the S programming language, which was designedat Bell Laboratories in the mid-1970s.

Base R was built for statisticians and for doing data analysis, but not necessarilyfor modern Data Science

It's age and legacy brings along old implementations of data structures andabbreviated function (commands) names

Source: David Smith, Over 16 years of R project history, Revolutions blog, last updated on March 4, 2016, accessed September20, 2017, http://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html

9 / 24

Page 20: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Modernizing R with tidyverse

Over the last 3 years, chief scientist at RStudio, Hadley Wickham, has brought Rinto the modern era with the tidyverse .

The tidyverse is an opinionated collection of R packages designed for data science.All packages share an underlying philosophy and common APIs.— Front page of the Tidyverse website

In practice, this meant reducing everything to a small, core set of commands thatall behave in a similar way.

10 / 24

Page 21: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Core tidyverse

ggplot2 : ggplot2 is a system for declaratively creating graphics, based on TheGrammar of Graphics. You provide the data, tell ggplot2 how to map variables toaesthetics, what graphical primitives to use, and it takes care of the details.

dplyr : dplyr provides a grammar of data manipulation, providing a consistentset of verbs that solve the most common data manipulation challenges.

tidyr : tidyr provides a set of functions that help you get to tidy data. Tidy datais data with a consistent form: in brief, every variable goes in a column, and everycolumn is a variable.

Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/

11 / 24

Page 22: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Core tidyverse

readr : readr provides a fast and friendly way to read rectangular data (like csv,tsv, and fwf). It is designed to �exibly parse many types of data found in the wild,while still cleanly failing when data unexpectedly changes.

purrr : purrr enhances R's functional programming (FP) toolkit by providing acomplete and consistent set of tools for working with functions and vectors. Onceyou master the basic concepts, purrr allows you to replace many for loops withcode that is easier to write and more expressive.

tibble : tibble is a modern re-imaginging of the data frame, keeping what timehas proven to be effective, and throwing out what it has not. Tibbles aredata.frames that are lazy and surly: they do less and complain more forcing you toconfront problems earlier, typically leading to cleaner, more expressive code.

Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/

11 / 24

Page 23: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

dplyr package

12 / 24

Page 24: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Get copy of dplyr demo repository

On website, spring18.cds101.com, click Materials → Class 8

Obtain a copy of the linked repository

Load in RStudio, and follow along in demos

13 / 24

Page 25: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

select()

14 / 24

Page 26: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

select() demo

Follow along in RStudio

15 / 24

Page 27: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

%>% asideInstead of this:

We write this:

Show the order of transformations

Useful when we have to chain together many transformations!

select(presidential, name, party)

presidential %>% select(name, party)

16 / 24

Page 28: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

arrange()

17 / 24

Page 29: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

arrange() demo

Follow along in RStudio

18 / 24

Page 30: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

slice()

19 / 24

Page 31: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

slice() demo

Follow along in RStudio

20 / 24

Page 32: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

filter()

21 / 24

Page 33: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Comparisons

Simple comparisons can be made using the following symbols:

> : greater than

>= : greater than or equal to

< : less than

<= : less than or equal to

!= : not equal

== : equal

22 / 24

Page 34: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

Logical operators

Source: Digital image of logical operations, R for Data Science website, accessed September 20, 2017,http://r4ds.had.co.nz/transform.html#logical-operators

23 / 24

Page 35: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative

filter() demo

Follow along in RStudio

24 / 24


Recommended