S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹...

Class 8: Data wrangling I

February 15, 2018

These slides are licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.

http://creativecommons.org/licenses/by-sa/4.0/

General

2 / 24

Annoucements

Reading for next class: R for Data Science - chapters 4 (short) and 5

Homework 1 posted on website, http://spring18.cds101.com, due Friday, February23rd by 11:59pm

RStudio cheatsheet resource

Will post cheatsheets on website soon

3 / 24

http://spring18.cds101.com/

What is data wrangling?

4 / 24

The word "wrangle"wrangle

verb

to tend or round up (cattle, horses, or other livestock).— dictionary.com

5 / 24

http://www.dictionary.com/browse/wrangle


verb


So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)

5 / 24



verb



Also encompasses the act of transforming data as a processing step to facilitateanalysis

5 / 24



verb



Also encompasses the act of transforming data as a processing step to facilitateanalysis

Informal word, but data scientists will understand what you mean if you use it

5 / 24


The word "wrangle"

Source: Digital image of a cowboy wrangling data, Digital image on likelihoodlog.com, accessed September 20, 2017,http://www.likelihoodlog.com/?p=1151

5 / 24

http://i2.wp.com/rocketsci.azurewebsites.net/wp-content/uploads/2015/07/data-wrangling-1.jpg

http://www.likelihoodlog.com/?p=1151

ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized

6 / 24


It would be nice if all datasets were like this! ...but they're the exceptions to therule

6 / 24


It would be nice if all datasets were like this! ...but they're the exceptions to therule

Most raw datasets need cleaning, and this is where data scientists will spend mostof their time

Source: Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Digital image on forbes.com,accessed September 20, 2017, https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/

6 / 24

https://blogs-images.forbes.com/gilpress/files/2016/03/Time-1200x511.jpg

https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/

The "data wrangling" pipeline

7 / 24


import → obtain data and get it into R

7 / 24



tidy → reshape rows and columns to follow the Tidy data rules

7 / 24

http://r4ds.had.co.nz/tidy-data.html




transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.

7 / 24





transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.

7 / 24


Data wrangling in R

8 / 24

A few bits of R history

The �rst stable version of R, v1.0.0, was released on February 29, 2000.

R itself is an implementation of the S programming language, which was designedat Bell Laboratories in the mid-1970s.

Base R was built for statisticians and for doing data analysis, but not necessarilyfor modern Data Science

It's age and legacy brings along old implementations of data structures andabbreviated function (commands) names

Source: David Smith, Over 16 years of R project history, Revolutions blog, last updated on March 4, 2016, accessed September20, 2017, http://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html

9 / 24

https://en.wikipedia.org/wiki/S_(programming_language)

http://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html

http://blog.revolutionanalytics.com/

http://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html

Modernizing R with tidyverse

Over the last 3 years, chief scientist at RStudio, Hadley Wickham, has brought Rinto the modern era with the tidyverse .

The tidyverse is an opinionated collection of R packages designed for data science.All packages share an underlying philosophy and common APIs.— Front page of the Tidyverse website

In practice, this meant reducing everything to a small, core set of commands thatall behave in a similar way.

10 / 24

https://www.tidyverse.org/

Core tidyverse

ggplot2 : ggplot2 is a system for declaratively creating graphics, based on TheGrammar of Graphics. You provide the data, tell ggplot2 how to map variables toaesthetics, what graphical primitives to use, and it takes care of the details.

dplyr : dplyr provides a grammar of data manipulation, providing a consistentset of verbs that solve the most common data manipulation challenges.

tidyr : tidyr provides a set of functions that help you get to tidy data. Tidy datais data with a consistent form: in brief, every variable goes in a column, and everycolumn is a variable.

Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/

11 / 24

https://www.tidyverse.org/packages/

Core tidyverse

readr : readr provides a fast and friendly way to read rectangular data (like csv,tsv, and fwf). It is designed to �exibly parse many types of data found in the wild,while still cleanly failing when data unexpectedly changes.

purrr : purrr enhances R's functional programming (FP) toolkit by providing acomplete and consistent set of tools for working with functions and vectors. Onceyou master the basic concepts, purrr allows you to replace many for loops withcode that is easier to write and more expressive.

tibble : tibble is a modern re-imaginging of the data frame, keeping what timehas proven to be effective, and throwing out what it has not. Tibbles aredata.frames that are lazy and surly: they do less and complain more forcing you toconfront problems earlier, typically leading to cleaner, more expressive code.

Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/

11 / 24

https://www.tidyverse.org/packages/

dplyr package

12 / 24

Get copy of dplyr demo repository

On website, spring18.cds101.com, click Materials → Class 8

Obtain a copy of the linked repository

Load in RStudio, and follow along in demos

13 / 24

select()

14 / 24

select() demo

Follow along in RStudio

15 / 24

%>% asideInstead of this:

We write this:

Show the order of transformations

Useful when we have to chain together many transformations!

select(presidential, name, party)

presidential %>% select(name, party)

16 / 24

arrange()

17 / 24

arrange() demo


18 / 24

slice()

19 / 24

slice() demo


20 / 24

filter()

21 / 24

Comparisons

Simple comparisons can be made using the following symbols:

> : greater than

>= : greater than or equal to

< : less than

<= : less than or equal to

!= : not equal

== : equal

22 / 24

Logical operators

Source: Digital image of logical operations, R for Data Science website, accessed September 20, 2017,http://r4ds.had.co.nz/transform.html#logical-operators

23 / 24

http://r4ds.had.co.nz/diagrams/transform-logical.png

http://r4ds.had.co.nz/transform.html#logical-operators

filter() demo


24 / 24

Date post:	09-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹...

Documents