+ All Categories
Home > Technology > Data Hacking with RHadoop

Data Hacking with RHadoop

Date post: 11-May-2015
Category:
Upload: ed-kohlwey
View: 2,401 times
Download: 0 times
Share this document with a friend
Description:
Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science. We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.
Popular Tags:
25
Using R and Hadoop to do largescale data science Rhadoop Data Hacking
Transcript
Page 1: Data Hacking with RHadoop

Using  R  and  Hadoop  to  do  large-­‐scale  data  science  

Rhadoop  Data  Hacking  

Page 2: Data Hacking with RHadoop

•  Predict  X?  – The  outcome  of  a  future  event  – Who  is  likely  to  do  something  – Gene?c  factors  leading  to  disease  

•  Pre-­‐filter  things  so  humans  can  accomplish  more?  

•  Do  all  of  this  faster  and  beCer?  

Would  You  Like  to…  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   2  

Page 3: Data Hacking with RHadoop

•  R  is  a  fantas?c  plaHorm  for  data  science  –  Has  a  peer-­‐reviewed  community  

and  journal  that  vets  libraries  –  (Mostly)  intui?ve  language  

•  Hadoop  is  the  de-­‐facto  plaHorm  for  parallel  processing  

•  Today,  we’ll  be  talking  about  rmr,  but  there’s  two  more  packages:  rhbase  and  rhdfs  

Why  R  and  Hadoop?  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   3  

Page 4: Data Hacking with RHadoop

•  Some  of  the  most  effec?ve  techniques  for  data  mining  are  rela?vely  old  –  Modern  SVM  dates  back  to  ‘92  –  Logis?c  regression  dates  back  to  ‘44  –  Important  elements  of  the  algorithms  date  back  to  Newton  

•  Accessibility  and  relevance  have  changed  –  Accessibility  to  data  –  Accessibility  of  computa?onal  power  –  Necessity  of  methods  

Nothing  Has  Changed.  Everything  Has  Changed.  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   4  

Page 5: Data Hacking with RHadoop

•  R  docs  are  wriCen  in  their  own  language  (using  data  frames,  etc.)  that  is  unfamiliar  to  computer  scien?sts  

•  R  and  CRAN  documenta?on  are  more  like  old-­‐school  GNU  than  most  Apache  projects  –  Get  used  to  Googling  and  using  R’s  help()  func?on  

•  R’s  data  management  facili?es  are  inconsistent  •  Streaming  API  isn’t  super  fast  •  (get  over  it)  

Some  CriBcisms  of  R  &  Rhadoop  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   5  

Page 6: Data Hacking with RHadoop

•  SNOW/SNOWFALL  –  Operates  over  MPI,  Sockets,  or  PVM  –  No  ?e-­‐in  to  a  DFS  (bad  for  data-­‐intensive  compu?ng)  –  Handles  matrix  mul?plica?on  well  (perhaps  beCer)  –  Doesn’t  handle  other  non-­‐trivial  IPC  well  (basically  for  parallel  linear  

algebra  and  simula?ons)  

•  Rmpi  –  More  code  –  All  synchroniza?on  constructs  are  user-­‐built  (just  like  MPI)  

Comparison  to  Other  R  Parallelism  Frameworks  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   6  

Page 7: Data Hacking with RHadoop

•  Others…  – Only  other  Hadoop  libraries  have  integra?on  with  HDFS/are  appropriate  for  data  intensive  compu?ng  

– Only  Rhadoop  supports  local  and  cluster  based  backends  and  has  an  intui?ve  interface  that  duplicates  closures  in  the  remote  environment  

– Most  environments  are  targeted  towards  modeling  and  simula?on  

Comparison  to  Other  R  Parallelism  Frameworks  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   7  

Page 8: Data Hacking with RHadoop

•  Install  R  –  Macports  –  sudo port install r-framework!–  Ubuntu  –  sudo apt-get install r-base!–  RHEL  –  sudo yum install R!

•  Install  R  dependencies  (inside  R)  –  install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"),

repos="http://watson.nci.nih.gov/cran_mirror/”)!

•  Install  RMR  –  curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/

rmr_1.3.1.tar.gz > rmr.tar.gz!–  install.packages("rmr.tar.gz”) # from inside r, in the same

directory!

•  Configure  the  local  backend  each  ?me  you  run  R  –  rmr.options.set(backend=“local”)!

InstallaBon  –  Local  WorkstaBon  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   8  

Page 9: Data Hacking with RHadoop

•  Install  R  and  all  packages  you  plan  on  using  (rmr,  e1071,  topicmodels,  tm,  etc.)  on  each  node.  

•  Use  a  compa?ble  version  of  Hadoop  1  (1.0.3+  or  CDH3+).  Hadoop  2  may  or  may  not  work.  

•  The  example  on  the  previous  slide  installs  R  packages  in  your  home  directory,  you  probably  want  to  install  them  to  the  root  install.  

•  Configure  environment  variables  export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar!

InstallaBon  -­‐  Cluster  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   9  

Page 10: Data Hacking with RHadoop

The  Curse  of  Dimensionality  •  The  volume  of  the  unit  sphere  

tends  towards  0  as  the  dimensionality  of  hyperspace  increases  

•  Intui?vely  this  means  that  there  is  more  “slop  room”  for  your  dividing  hyperplane  to  fall  into  

•  The  amount  of  data  we  need  to  train  a  model  rises  with  the  feature  space,  tending  towards  infinity,  making  the  problem  untenable  

•  With  a  small  feature  space,  there  is  no  need  for  lots  of  data  

•  Thus,  there  is  liCle  point  in  using  Hadoop  to  implement  many  classic  machine  learning  models  

10  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton  

Volume  of  the  Unit  Ball  vs.  Dimensionality  

Page 11: Data Hacking with RHadoop

•  Join  •  Sample  •  Model  •  Repeat  

The  Hadoop  Data  Science  Flow  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   11  

Page 12: Data Hacking with RHadoop

•  Put  two  pieces  of  data  together  using  a  common  key  

•  Scenario:  – Data  is  in  two  flat  files  in  HDFS  – Turn  rows  into  rows  of  key-­‐value  pairs,  where  the  key  is  the  join  key  and  the  value  is  the  rest  of  the  row  

Join  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   12  

Page 13: Data Hacking with RHadoop

•  Take  a  sample  of  your  (maybe)  joined  data  •  Most  common  method  is  probabilis?cally  •  Numerous  other  techniques  can  leverage  par??ons  and  randomness  of  the  key  hash  

•  Scenarios  (a  precursor  for):  –  Supervised  learning/classifica?on  –  Unsupervised  learning/clustering  –  Regression  –  Distribu?on  modeling  

Sample  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   13  

Page 14: Data Hacking with RHadoop

•  Supervised  learning:  I  want  to  predict  something  and  I  already  know  (some)  of  the  answers.  Also  called  classifica?on  and  binary  classifica?on  

•  Unsupervised  learning:  I  want  to  find  natural  groupings  in  the  data  that  I  might  not  have  known  about  

•  Regression,  probability  modeling  –  I  want  to  fit  a  curve  to  my  data  

Model  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   14  

Page 15: Data Hacking with RHadoop

•  Gain  insight  about  the  data  •  Change  your  procedure  (select  only  outliers,  etc.)  

•  Gain  more  insight  

Repeat  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   15  

Page 16: Data Hacking with RHadoop

•  Work  totally  in  R  •  Execute  large,  complex  joins  such  as  cross  joins  

Rhadoop  Impact:  Join,  Sample  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   16  

Page 17: Data Hacking with RHadoop

•  Most  algorithms  work  perfectly  well  (or  beCer)  over  a  sample  of  the  data  

•  Train  and  cross-­‐validate  a  large  number  of  models  in  parallel  

•  Perform  model  selec?on  in  the  reduce  phase  

Rhadoop  Impact:  Model  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   17  

Page 18: Data Hacking with RHadoop

mapreduce(! input,! output = NULL,! map = to.map(identity),! reduce = NULL,! combine = NULL,! reduce.on.data.frame = FALSE,! input.format = "native",! output.format = "native",! vectorized = list(map = FALSE, reduce = FALSE),! structured = list(map = FALSE, reduce = FALSE),! backend.parameters = list(),! verbose = TRUE)!

Rhadoop  API  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   18  

Page 19: Data Hacking with RHadoop

rmr.options.set(backend = c("hadoop", "local"),! profile.nodes = NULL, vectorized.nrows = NULL) !to.dfs(object, output = dfs.tempfile(), ! format = "native")!!from.dfs(input, format = "native", ! to.data.frame = FALSE, vectorized = FALSE,! structured = FALSE)  

Rhadoop  API  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   19  

Page 20: Data Hacking with RHadoop

•  Objects  –  my_car = list(color=“green”, model=“volt”)!

•  Transforming a vector (list), iterating –  lapply/sapply/tapply – functional programming constructs

•  Loops (not preferred) –  for ( i in 1:100) {…}!–  Note this is the same as lapply(1:100, function(i){…})!

•  Other control structures – basically as you would expect

Doing  Things  the  R  Way  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   20  

Page 21: Data Hacking with RHadoop

•  R  helps  you!  O_o  •  Every  object  has  a  mode  and  length  and  hence  can  be  interpreted  as  some  

sort  of  vector  –  even  primi?ves!  •  Even  primi?ves  such  as  strings  or  integers  are  stored  in  a  vector  of  length  

1,  never  free-­‐standing  •  There  are  lots  of  types  of  vectors  

–  Lists  (think  linked  list)  –  Atomic  vectors  (think  array)  

hCp://cran.r-­‐project.org/doc/manuals/R-­‐intro.html#The-­‐intrinsic-­‐aCributes-­‐mode-­‐and-­‐length  

•  Type  coercion  usually  works  the  way  you  would  expect  –  But…  you  may  find  yourself  using  as.list()  or  as.vector()  or  doing  manual  coercion  

frequently  depending  on  what  libraries  you’re  using  due  to  mode  not  matching  

Vectors  in  R  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   21  

Page 22: Data Hacking with RHadoop

fakedata = data.frame(x = c(rnorm(100)*.25, rep(.75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)), z = c(rep(0,100), rep(1,100)) )!!plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"], function(z) ifelse(z>0,"blue","green")))!

Example  –  Fake  Data  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   22  

Page 23: Data Hacking with RHadoop

rmr.options.set(backend=“local”)!!ints = to.dfs(1:100)!!squares = mapreduce(ints, map=function(x) reyval(NULL,x^2))!!print from.dfs(ints)!!# notice the result will be !# keyvals!

Examples  –  Simple  Parallelism  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   23  

Page 24: Data Hacking with RHadoop

kernels = to.dfs(list("linear","polynomial","radial","sigmoid"))!!models = from.dfs(mapreduce(kernels,map=function(nothing,kern) keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))!!plot(models[[1]][["val"]],fakedata)!!!

Examples  –  Trying  Lots  of  SVM  Kernels  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   24  

Page 25: Data Hacking with RHadoop

calls = to.dfs(list(list("glm",z~.,family=binomial("logit"), fakedata),list("svm",z~.,fakedata)))!!models = from.dfs(mapreduce(calls, map=function(nothing,callsig) keyval(NULL,do.call(callsig[[1]],callsig[2:length(callsig)]))))!!models[[1]][["val"]]!

Examples  –  Different  Models  

This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   25  


Recommended