Post on 14-Oct-2020
transcript
Hadley Wickham @hadleywickham Chief Scientist, RStudio
Pipelines for data analysis in R
October 2015
Data analysis is the process by which data becomes
understanding, knowledge and insight
Data analysis is the process by which data becomes
understanding, knowledge and insight
Data analysis is the process by which data becomes
understanding, knowledge and insight
Data analysis is the process by which data becomes
understanding, knowledge and insight
Transform
Visualise
Model
Tidy
Import
Surprises, but doesn't scale
Scales, but doesn't (fundamentally) surprise
Create new variables & new summariesConsistent way of storing data
Transform
Visualise
Model
tidyr dplyrTidy
Importreadr readxl haven DBI httr
broom
ggplot2 ggvis
Pipelines
Think it Do itDescribe it
Cognitive
Computational
(precisely)
Cognition time ≫ Computation time
http://ww
w.flickr.com/photos/m
utsmuts/4695658106
%>%Inspirations: unix, F#, haskell, clojure, method chaining
magrittr::
foo_foo <- little_bunny()
bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head )
# vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)
x %>% f(y) # f(x, y)
x %>% f(z, .) # f(z, x)
x %>% f(y) %>% g(z) # g(f(x, y), z)
# Turns function composition (hard to read) # into sequence (easy to read)
# Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result.
# tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings
Tidy
Transform
Visualise
Model
tidyr dplyrTidy
Importreadr readxl haven DBI httr
ggplot2 ggvis
broom
Storage Meaning
Table / File Data set
Rows Observations
Columns Variables
Tidy data = data that makes data analysis easy
Source: local data frame [5,769 x 22]
iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)
What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)
# To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables
tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2
# Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3
# Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest
Google for “tidyr” &
“tidy data”
Transform
Transform
Visualise
Model
tidyr dplyrTidy
ggplot2 ggvis
broom
Importreadr readxl haven DBI httr
Think it Do itDescribe it
Cognitive
Computational
(precisely)
One table verbs
• select: subset variables by name• filter: subset observations by value• mutate: add new variables• summarise: reduce to a single obs• arrange: re-order the observations
+ group by
Demo
right_join()
full_join()
inner_join()
left_join()
Mutating
semi_join()
anti_join()
Filtering Set
intersect()
setdiff()
union()
dplyr sources• Local data frame (C++)• Local data table• Local data cube (experimental)• RDMS: Postgres, MySQL, SQLite,
Oracle, MS SQL, JDBC, Impala• MonetDB, BigQuery
Google for “dplyr”
Visualise
Transform
Visualise
Model
tidyr dplyrTidy
ggplot2 ggvis
broom
Importreadr readxl haven DBI httr
What is ggvis?• A grammar of graphics
(like ggplot2)
• Reactive (interactive & dynamic) (like shiny)
• A pipeline (a la dplyr)
•Of the web (drawn with vega)
Demo4-ggvis.R 4-ggvis.Rmd
Google for “ggvis”
Modelwith broom, by David Robinson
Transform
Visualise
Model
tidyr dplyrTidy
ggplot2 ggvis
broom
Importreadr readxl haven DBI httr
2.5
5.0
7.5
1990 1995 2000 2005 2010 2015date
log(sales)
46 TX cities, ~25 years of data
What makes it hard to see the long term
trend?
# Models are useful as tool for removing # known patterns
tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )
−2
−1
0
1
1990 1995 2000 2005 2010 2015date
resid
# Models are also useful in their own right
models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )
Model summaries
• Model level: one row per model• Coefficient level: one row per
coefficient (per model)• Observation level: one row per
observation (per model)
Demo5-broom.R
Google for “broom r”
Big data and R
Big Can’t fit in memory on one computer: >5 TB
Medium Fits in memory on a server: 10 GB-5 TB
Small Fits in memory on a laptop: <10 GB
R is great at this!
R• R provides an excellent environment for
rapid interactive exploration of small data.
• There is no technical reason why it can’t also work well with medium size data. (But the work mostly hasn’t been done)
• What about big data?
1. Can be reduced to a small data problem with subsetting/sampling/summarising (90%)
2. Can be reduced to a very large number of small data problems (9%)
3. Is irreducibly big (1%)
The right small data
• Rapid iteration essential• dplyr supports this activity by avoiding
cognitive costs of switching between languages.
Lots of small problems
• Embarrassingly parallel (e.g. Hadoop)• R wrappers like foreach, rhipe, rhadoop• Challenging is matching architecture of
computing to data storage
Irreducibly big
• Computation must be performed by specialised system.
• Typically C/C++, Fortran, Scala.• R needs to be able to talk to those
systems.
Future work
End gameProvide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process.The best tools become invisible with time!Still a lot of work to do, especially on the connection between modelling and visualisation.
Transform
Visualise
Model
tidyr dplyrTidy
Importreadr readxl haven DBI httr
broom
ggplot2 ggvis