+ All Categories
Home > Documents > Introduction to R and the tidyverse · 2017. 11. 21. · Filtering allDecemberflights filter(df,...

Introduction to R and the tidyverse · 2017. 11. 21. · Filtering allDecemberflights filter(df,...

Date post: 26-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
34
Introduction to R and the tidyverse Paolo Crosetto
Transcript
  • Introduction to R and the tidyverse

    Paolo Crosetto

  • Lecture 2: basic data manipulation

  • Before we start: nycflights

    I install.packages("nycflights13")I some data about all flights from New York airports in 2013I we get to know arrival and departure times, delays, carrier, some info about

    the planeI origin, destination, and so onI not particularly interesting per se but big (336K observations)

  • Getting to know the data: inspectionlibrary(nycflights13)

    flights

    ## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time## ## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 12 more variables: sched_arr_time ,## # arr_delay , carrier , flight , tailnum ,## # origin , dest , air_time , distance , hour ,## # minute , time_hour

  • Getting to know the data: View

    View(flights)

    I View() opens an Rstudio data windowI in that window you canI sortI arrangeI inspect variables

  • Inspecting data and summary statisticssummary(flights)

    ## year month day dep_time## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907## Median :2013 Median : 7.000 Median :16.00 Median :1401## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400## NA's :8255## sched_dep_time dep_delay arr_time sched_arr_time## Min. : 106 Min. : -43.00 Min. : 1 Min. : 1## 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124## Median :1359 Median : -2.00 Median :1535 Median :1556## Mean :1344 Mean : 12.64 Mean :1502 Mean :1536## 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945## Max. :2359 Max. :1301.00 Max. :2400 Max. :2359## NA's :8255 NA's :8713## arr_delay carrier flight tailnum## Min. : -86.000 Length:336776 Min. : 1 Length:336776## 1st Qu.: -17.000 Class :character 1st Qu.: 553 Class :character## Median : -5.000 Mode :character Median :1496 Mode :character## Mean : 6.895 Mean :1972## 3rd Qu.: 14.000 3rd Qu.:3465## Max. :1272.000 Max. :8500## NA's :9430## origin dest air_time distance## Length:336776 Length:336776 Min. : 20.0 Min. : 17## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 502## Mode :character Mode :character Median :129.0 Median : 872## Mean :150.7 Mean :1040## 3rd Qu.:192.0 3rd Qu.:1389## Max. :695.0 Max. :4983## NA's :9430## hour minute time_hour## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 05:00:00## 1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00## Median :13.00 Median :29.00 Median :2013-07-03 10:00:00## Mean :13.18 Mean :26.23 Mean :2013-07-03 05:02:36## 3rd Qu.:17.00 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00## Max. :23.00 Max. :59.00 Max. :2013-12-31 23:00:00##

  • Importing data in your workspaceI flights is not yet in your workspaceI if you want to import it, you have to do it explicitelyI using

  • Data manipulation: dplyr

    we will use the package dplyr, from the tidyverse

    I don’t worry about the strange name, it has a reason.I install it with the usual install.packages("dplyr")I but it should be already there from last time when we installed ggplot

  • Structure of dplyr

    I the idea is to have simple, direct verbs to do the main jobs for youI dplyr verbs take a data.frame as a first argumentI do some manipulations on itI and always return a data.frameI they do not alter the data saved on memoryI which is good, because you can manipulate data without altering itI if you want to save the altered dataset, use assign

  • Data manipulation: filter()

    filter() allows to extract rows from the data frame

    I filter() takes two arguments:I filter(data, logic)I data is your dataI logic is a logical statement that tells filter() what mut be included or

    notI R understands all usual comparison operators >, >=,

  • Filteringall December flights

    filter(df, month == 12)

    ## # A tibble: 28,135 x 19## year month day dep_time sched_dep_time dep_delay arr_time## ## 1 2013 12 1 13 2359 14 446## 2 2013 12 1 17 2359 18 443## 3 2013 12 1 453 500 -7 636## 4 2013 12 1 520 515 5 749## 5 2013 12 1 536 540 -4 845## 6 2013 12 1 540 550 -10 1005## 7 2013 12 1 541 545 -4 734## 8 2013 12 1 546 545 1 826## 9 2013 12 1 549 600 -11 648## 10 2013 12 1 550 600 -10 825## # ... with 28,125 more rows, and 12 more variables: sched_arr_time ,## # arr_delay , carrier , flight , tailnum ,## # origin , dest , air_time , distance , hour ,## # minute , time_hour

  • Filtering over multiple criteriaall Christmas flights departed around midday

    filter(df, month == 12 & day == 25 & dep_time>1100 & dep_time

  • multiple or statements and %in%

    I mutliple ‘or’ statement can be trickyI you want all summer flights (June, July, August)I then you should do filter(df, month == 6 | month == 7 | month ==

    8)I can become cumbersomeI you can use an alternative notation, filter(df, month %in% c(6,7,8))

  • Sort data: arrange()

    I arrange() lets you sort dataI it is like the sorting you do in View(), but:I 1. it is done on the console, andI 2. it lets you save the new order to a data frameI this can be useful when looking for special observations, like the first flight

    of each day

  • arrange()

    arrange(df, dep_time)

    ## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time## ## 1 2013 1 13 1 2249 72 108## 2 2013 1 31 1 2100 181 124## 3 2013 11 13 1 2359 2 442## 4 2013 12 16 1 2359 2 447## 5 2013 12 20 1 2359 2 430## 6 2013 12 26 1 2359 2 437## 7 2013 12 30 1 2359 2 441## 8 2013 2 11 1 2100 181 111## 9 2013 2 24 1 2245 76 121## 10 2013 3 8 1 2355 6 431## # ... with 336,766 more rows, and 12 more variables: sched_arr_time ,## # arr_delay , carrier , flight , tailnum ,## # origin , dest , air_time , distance , hour ,## # minute , time_hour

  • Selecting columns: select()

    What filter() does with rows, select() does with columns

    select(df, dep_time, carrier)

    ## # A tibble: 336,776 x 2## dep_time carrier## ## 1 517 UA## 2 533 UA## 3 542 AA## 4 544 B6## 5 554 DL## 6 554 UA## 7 555 B6## 8 557 EV## 9 557 B6## 10 558 AA## # ... with 336,766 more rows

  • select() examples

    I you want more than one variable: you list all variables by bare name:select(data, var1, var2, var3, ...)

    I you want to exclude (drop) some variables: you use the minus sign:select(data, -var1, -var2, ...)

    I you want to exploit naming patterns: you use starts_with("string"),ends_with("string"), contains("string")

    I you want to select all variables: you use everything()I everything is useful to reorder variables

  • renaming variables: rename()I rename() is a version of select() that keeps all variablesI you use it as rename(data, newname = oldname)

    rename(df, mois = month)

    ## # A tibble: 336,776 x 19## year mois day dep_time sched_dep_time dep_delay arr_time## ## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 12 more variables: sched_arr_time ,## # arr_delay , carrier , flight , tailnum ,## # origin , dest , air_time , distance , hour ,## # minute , time_hour

  • creating new variables: mutate()you want to create a new variable with some manipulation

    I you can use mutate(data, newvar = f(oldvar))I where f() is some function or manipulation

    df_new

  • mutate() properties

    I you can create more variables in one single mutate() call

    df_new

  • mutate() properties

    I you can use right away the newly created variables

    df_new

  • functions that work with mutate()

    I all vector functions: input a vector, output a vectorI arithmetic operations: + - * / ˆ (or **) exp(), sqrt(), log()I offset: lag() to refer to the -1 period. useful for incrementsI cumulative funcsions: cumsum(), etc. . .I logical (see filter())

  • summarise your data: summarize()

    I if you want to create aggregations and summary fo your dataI you cna use summarize(data, newvar = f(oldvar))I it works similarly to mutate() but it returns a scalar

    summarize(df, meandelay = mean(dep_delay, na.rm = TRUE))

    ## # A tibble: 1 x 1## meandelay## ## 1 12.63907

  • summarize() properties

    I you can do more than one summary in one go

    summarize(df, meandelay = mean(dep_delay, na.rm = TRUE), sddelay = sd(dep_delay, na.rm = TRUE), meandeptime = mean(dep_time, na.rm = TRUE))

    ## # A tibble: 1 x 3## meandelay sddelay meandeptime## ## 1 12.63907 40.21006 1349.11

  • making summarize() more useful: group_by()summarize() is not so useful -> one value

    I but it can be extremely useful if we can group data by a special variable, tosee, e.g., monthly values

    I this is what you do with group_by()I which groups the data frame by levels of one (or more) variablesnI nothing changes but now R knows this is a grouped data frame

    group_by(df, month)

    ## # A tibble: 336,776 x 19## # Groups: month [12]## year month day dep_time sched_dep_time dep_delay arr_time## ## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 12 more variables: sched_arr_time ,## # arr_delay , carrier , flight , tailnum ,## # origin , dest , air_time , distance , hour ,## # minute , time_hour

  • example: delay by month

    I output has 12 rows

    df_bymonth

  • example: delay by month and day

    I output has 365 rows

    df_bymonth_byday

  • Exercise: a complex example

    what is the speed of flights departing around midday (11 - 13),by month?

    I the input is the flights dfI use filter(), select(), mutate(), group_by() and summarize()I the final output is a df with 2 variables (month and speed) and 12

    observations (one for each month)

    à vous de jouer!

  • A somewhat more complex exampleI let’s combine filter(), select(), mutate(), etc. . .I we want to know the average speed of flights departing around midday, by

    month

    df_midday 1100 & dep_time < 1300)df_midday_reduced

  • Multiple operations: problems

    the code in the previous slide has several problems

    I it forces yu to create several different data framesI this clutters your environmentI but not only: at each step, you have to refer to the right data frameI else you make mistakeI hard to find proper namesI easy to make errorsI what can we do?

  • Enters the pipe! %>%the pipe operator

    I feeds the result of a command as the first argument of the followingcommand

    2*2

    ## [1] 4

    exp(4)

    ## [1] 54.59815

    (2*2) %>% exp()

    ## [1] 54.59815

    I since dplyr verbs always return a data frame, and always have a data frameas first input, this saves your life

  • the same code with the pipedf_midday %>% filter(dep_time > 1100 & dep_time < 1300) %>%

    select(month, air_time, distance) %>%mutate(speed = distance / air_time * 60) %>%group_by(month) %>%summarise(meanspeed = mean(speed, na.rm = TRUE))

    ## # A tibble: 12 x 2## month meanspeed## ## 1 1 365.4416## 2 2 368.8783## 3 3 385.2398## 4 4 384.5480## 5 5 400.5198## 6 6 396.5697## 7 7 400.5444## 8 8 397.3675## 9 9 406.9074## 10 10 392.7710## 11 11 379.3145## 12 12 373.4685

  • Learning to use the pipe

    I what is the mean delay of flights by carrier, for each month?I what is the maximum departure delay occurred for each of the three NYC

    airports, by each day?I what is the mean air time for each of these three airports? from which

    airport do the longer haul flights depart?

  • combining the pipe with a plot

    you can add a ggplot at the end of the pipe

    I ggplot does not work with the pipe, it uses + as we knowI so it must come last of the pipeI try to show barplots showing graphically the results of the questions in the

    previous slides

    Lecture 2: basic data manipulation


Recommended