Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

transcript

Hadley Wickham @hadleywickham Chief Scientist, RStudio

Pipelines for data analysis in R

October 2015

Data analysis is the process by which data becomes

understanding, knowledge and insight

Transform

Visualise

Import

Surprises, but doesn't scale

Scales, but doesn't (fundamentally) surprise

Create new variables & new summariesConsistent way of storing data

Transform

Visualise

tidyr dplyrTidy

Importreadr readxl haven DBI httr

ggplot2 ggvis

Pipelines

Think it Do itDescribe it

Cognitive

Computational

(precisely)

Cognition time ≫ Computation time

http://ww

w.flickr.com/photos/m

utsmuts/4695658106

%>%Inspirations: unix, F#, haskell, clojure, method chaining

magrittr::

foo_foo <- little_bunny()

bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head )

# vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

x %>% f(y) # f(x, y)

x %>% f(z, .) # f(z, x)

x %>% f(y) %>% g(z) # g(f(x, y), z)

# Turns function composition (hard to read) # into sequence (easy to read)

# Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result.

# tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

Storage Meaning

Table / File Data set

Rows Observations

Columns Variables

Tidy data = data that makes data analysis easy

Source: local data frame [5,769 x 22]

iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)

What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)

# To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables

tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2

# Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3

# Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest

Google for “tidyr” &

“tidy data”

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

Think it Do itDescribe it

Cognitive

Computational

(precisely)

One table verbs

• select: subset variables by name• filter: subset observations by value• mutate: add new variables• summarise: reduce to a single obs• arrange: re-order the observations

+ group by

right_join()

full_join()

inner_join()

left_join()

Mutating

semi_join()

anti_join()

Filtering Set

intersect()

setdiff()

union()

dplyr sources• Local data frame (C++)• Local data table• Local data cube (experimental)• RDMS: Postgres, MySQL, SQLite,

Oracle, MS SQL, JDBC, Impala• MonetDB, BigQuery

Google for “dplyr”

Visualise

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

What is ggvis?• A grammar of graphics

(like ggplot2)

• Reactive (interactive & dynamic) (like shiny)

• A pipeline (a la dplyr)

•Of the web (drawn with vega)

Demo4-ggvis.R 4-ggvis.Rmd

Google for “ggvis”

Modelwith broom, by David Robinson

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

1990 1995 2000 2005 2010 2015date

log(sales)

46 TX cities, ~25 years of data

What makes it hard to see the long term

trend?

# Models are useful as tool for removing # known patterns

tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )

1990 1995 2000 2005 2010 2015date

# Models are also useful in their own right

models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )

Model summaries

• Model level: one row per model• Coefficient level: one row per

coefficient (per model)• Observation level: one row per

observation (per model)

Demo5-broom.R

Google for “broom r”

Big data and R

Big Can’t fit in memory on one computer: >5 TB

Medium Fits in memory on a server: 10 GB-5 TB

Small Fits in memory on a laptop: <10 GB

R is great at this!

R• R provides an excellent environment for

rapid interactive exploration of small data.

• There is no technical reason why it can’t also work well with medium size data. (But the work mostly hasn’t been done)

• What about big data?

1. Can be reduced to a small data problem with subsetting/sampling/summarising (90%)

2. Can be reduced to a very large number of small data problems (9%)

3. Is irreducibly big (1%)

The right small data

• Rapid iteration essential• dplyr supports this activity by avoiding

cognitive costs of switching between languages.

Lots of small problems

• Embarrassingly parallel (e.g. Hadoop)• R wrappers like foreach, rhipe, rhadoop• Challenging is matching architecture of

computing to data storage

Irreducibly big

• Computation must be performed by specialised system.

• Typically C/C++, Fortran, Scala.• R needs to be able to talk to those

systems.

Future work

End gameProvide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process.The best tools become invisible with time!Still a lot of work to do, especially on the connection between modelling and visualisation.

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

Documents