+ All Categories
Home > Documents > STOR 320 Data Transformation I - liyao880.github.io

STOR 320 Data Transformation I - liyao880.github.io

Date post: 18-Dec-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
31
STOR 320 Data Transformation I Lecture 4 Yao Li Department of Statistics and Operations Research UNC Chapel Hill
Transcript
Page 1: STOR 320 Data Transformation I - liyao880.github.io

STOR 320 Data Transformation I

Lecture 4Yao Li

Department of Statistics and Operations ResearchUNC Chapel Hill

Page 2: STOR 320 Data Transformation I - liyao880.github.io

Introduction

• Read Chapter 5

• Goal: Their Data Your Data

• Covers: • Data Subsetting• Data Ordering• Variable Selecting• Variable Creating

• Help: dplyr Package in R2

Page 3: STOR 320 Data Transformation I - liyao880.github.io

NYC Flights Meta Data

3

• Requirements:

• All 2013 Flights from NYC- US Bureau of Trans. Statistics

• To View all Data, Use

• For more information,

> install.packages(nycflights13)> library(nycflights13)

> View(flights)

> ?flights

Page 4: STOR 320 Data Transformation I - liyao880.github.io

4

• Four Different Types of Variables• int = integer• dbl = double• chr = character• dttm = date and times

•Other Types of Variables• lgl = logical (TRUE or FALSE)• fctr = factor• date = dates

NYC Flights Data

Page 5: STOR 320 Data Transformation I - liyao880.github.io

Basics of dplyr: 5 Key Functions

5

• 5 Key Functions• filter() = Chooses Observations Based on Values

• arrange() = Sorts Observations

• select() = Chooses Variables

• mutate() = Creates New Variables

• summarise() = Generates Statistics From Data

Page 6: STOR 320 Data Transformation I - liyao880.github.io

Basics of dplyr

6

• Function Usage• First, Specify the Dataset• Next, Specify What to Do with the Data• Result is a New Dataset

> filter(flights, month==9)

Page 7: STOR 320 Data Transformation I - liyao880.github.io

Comparisons

• Important Operators• Less Than (<)• Greater Than (>)• Not Equal (!=)• Equal (==)

• Returns TRUE or FALSE

7

Page 8: STOR 320 Data Transformation I - liyao880.github.io

Numerical Precision

8

• Problem

• Solution

Page 9: STOR 320 Data Transformation I - liyao880.github.io

Logical Operators

9

• Boolean Logic• And (&)• Or (|)• Not (!)

• Example

> filter(flights, month==9&day==1)

> filter(flights, month==9|day==1)

> filter(flights, month==9&!(day==1))

Page 10: STOR 320 Data Transformation I - liyao880.github.io

Missing Values

10

• Represented by NA

• Enduring Questions• To Impute or Not Impute• To Ignore or Not Ignore

• Handling Should Be Explained

• Be Careful When Performing Operations on Missing Data

Page 11: STOR 320 Data Transformation I - liyao880.github.io

Missing Values

11

Page 12: STOR 320 Data Transformation I - liyao880.github.io

filter()

12

• Used to Subset Observations Based on Their Values

• Selects Row if TRUE • Removes Row if FALSE

• Examples: • All Flights from 9/13/2018 Out

of LaGuardia Airport

• All Dec. or Nov. Flights

> filter(flights, month==9,day==13,origin ==“LGA”)

> filter(flights, month==11|month==12)

> filter(flights, month %in% c(11,12))

Page 13: STOR 320 Data Transformation I - liyao880.github.io

filter()

13

• Examples:

• Don’t Want Flights with Unusual Delays (> 120 min.)

• Want Flights with No Delays

> filter(flights, !(arr_delay>120 | dep_delay>120) )

> filter(flights, arr_delay <= 120, dep_delay <= 120)

> filter(flights, dep_delay==0 & arr_delay==0)

> filter(flights, dep_delay==0, arr_delay==0)

!(x | y) = !x & !y

Page 14: STOR 320 Data Transformation I - liyao880.github.io

filter()

14

• Examples:

• Want Flights Missing Air Time

• Do not Want Flights Missing Air Time

• Remove All Cases with Missing Values

> filter(flights, is.na(air_time) )

> filter(flights, !is.na(air_time) )

> na.omit(flights)

Page 15: STOR 320 Data Transformation I - liyao880.github.io

arrange()

15

• Used to Sort Observations• Sort flights by date

Page 16: STOR 320 Data Transformation I - liyao880.github.io

arrange()

16

• Sorting Experiment

Page 17: STOR 320 Data Transformation I - liyao880.github.io

arrange()

17

• Handling NA

Page 18: STOR 320 Data Transformation I - liyao880.github.io

select()

18

• Used to Select Variables

• Why? Not All Variables are Created Equal

• Need to Know Variable Names

Page 19: STOR 320 Data Transformation I - liyao880.github.io

select()

19

• Basic Examples

• Select Only Year, Month, Day

• Select All Variables Between dep_time to arr_delay

• Deselect All Variables Between dep_time to arr_delay

Page 20: STOR 320 Data Transformation I - liyao880.github.io

select()

20

• Select Based on Column Index

• Deselect Based on Column Index

Page 21: STOR 320 Data Transformation I - liyao880.github.io

select()

21

• Select Based on Text

• starts_with(“TEXT”)

• ends_with(“TEXT”)

• contains(“TEXT”)

Page 22: STOR 320 Data Transformation I - liyao880.github.io

select()

22

• Renaming Variables

• Can Use select()

• But Use rename()

Page 23: STOR 320 Data Transformation I - liyao880.github.io

select()

23

• Reordering Variables

Page 24: STOR 320 Data Transformation I - liyao880.github.io

mutate()

24

• Used to Create New Variables• Creative New Metrics• Modify Units• Transform Variables• Unique Identifiers• Numeric to Categorical• Categorical to Numeric

• Reduced Dataset

Page 25: STOR 320 Data Transformation I - liyao880.github.io

mutate()

25

• Example of mutate()

• Example of transmute()

Page 26: STOR 320 Data Transformation I - liyao880.github.io

mutate()

26

• Plethora of Examples• Basic and Modular Arithmetic

517 = 100 ∗ 5 + 17= 100 ∗ 517 %/% 100 + (517 %% 100)

5:17AM5:33AM5:42AM

Page 27: STOR 320 Data Transformation I - liyao880.github.io

mutate()

27

• Plethora of Examples• Nonlinear Transformation

Page 28: STOR 320 Data Transformation I - liyao880.github.io

mutate()

28

• Plethora of Examples• Offsets: lead() and lag()

Page 29: STOR 320 Data Transformation I - liyao880.github.io

mutate()

29

• Plethora of Examples• Cumulative and Rolling

Aggregates

• cumsum()• cumprod()• cummin()• cummax()• cummean()

Page 30: STOR 320 Data Transformation I - liyao880.github.io

mutate()

30

• Plethora of Examples• Ranking

• min_rank()• percent_rank()• cume_dist()• ntile()

Page 31: STOR 320 Data Transformation I - liyao880.github.io

Information

31

• Tutorial 3• Practice

• filter()• arrange()• select()• mutate()

• Introduced• Piping %>%• group_by()• summarize()


Recommended