+ All Categories
Home > Technology > Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Date post: 06-May-2015
Category:
Upload: big-data-spain
View: 561 times
Download: 2 times
Share this document with a friend
Description:
The workshop will illustrate a number of techniques for data modelling that help us extend our small data capabilities to the world of big data: sampling, resampling, parallelization where possible, etc. We will leverage the functional architecture of R and its statistical analysis prowess in small data environments using the mapreduce technique embedded in Hadoop to tackle large data analysis problems. Particular attention will be paid to the ubiquitous --but non-scalable-- logistic regression technique and its big data alternatives.
31
Workshop – Hadoop + R Carlos Gil Bellosta
Transcript
Page 2: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Big Data AnalyticsR & Hadoop

Carlos J. Gil Bellosta

[email protected]

November 2013

Page 3: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table of Contents

1 Intro to Hadoop & RAll about Hadoop

Hadoop FSHadoop & mapreduce

All about R

2 Counting (& Graphics)

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modelling

6 Final remarks

Page 4: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

File system: manages all aboutfiles

• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!

• Combination of hardware and software to hide boringactivities from users:

• Find space to write the files• Read/write files• Manage fragmentation• Etc.

• How many devices per FS?

• 1-to-1: diskettes, CD-ROMs, HDDs,...• n-to-1: partitioned HDDs,...• 1-to-n: RAIDs, Hadoop

Page 5: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Hadoop goodies (as a FS)

• Chuncks (large) files among machines

• Replicates chunks (default, 3)

• Balances data

• Robust to hardware failures

• It is rack aware

Obviously, it requires some system to keep track of:

• Which servers/racks are up/down

• Where each chunk is located

• ...

Page 6: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

How to work with data in Hadoop?

• Provides a shell (ls, cp, etc.)

• You can put/get data from your local FS to Hadoop FS

• This is:• You can dump your data to your local machine• You can run your programs in your local machine• You can put results back into Hadoop

• But what if the file is too large?

Solution

Rather than bringing the data to the code, why not moving thecode to the data?

One of the ways to move code to data is known as mapreduce.

Page 7: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Mapreduce

• Two step process:• Map: run your code on chunks all over• Reduce: reshape the output into the desired format

• Hadoop manages issues:• System failures• Threads that do not return• And all (?) that made life of OpenMP, MPI, etc. users

miserable

• Slotted approach: mapreduce provides slots where you putthe mappers/reducers code

• The code is for you to provide!

Page 8: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

What is R?

• R is a• software package?• programming language?• environment?

for data analysis and graphics.

• R users are (should be?) used to the mapreduce approach:

ddply(dfx, .(group, sex), summarize,

mean = mean(age),

sd = sd(age))

Page 9: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table of Contents

1 Intro to Hadoop & R

2 Counting (& Graphics)Graphics & big dataLet’s count... hexagons

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modelling

6 Final remarks

Page 10: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Visualizing a million

Page 11: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Fluctuation plot

Page 12: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table plot

Page 13: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

• Non-trivial counting exercise (no, we are not countingwords today!)

• Good visualization features for big datasets

• Fits in mapreduce framework:• Map: Assigns points to hexagons• Reduce: aggregates counts on hexagons• The output is small and can be plotted locally

Page 14: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table of Contents

1 Intro to Hadoop & R

2 Counting (& Graphics)

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modelling

6 Final remarks

Page 15: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

What you see: input/output, map,reduce

• input:• Type: text, csv, R object,...• Options: separator,...

• output: similar to input

• map & reduce:• Functions with (k,v) argument (k, key; v, value)• They return a k,v list• Thus, mapreduces can be chained together (the output of

the first one is the input for the second)

Page 16: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

What you don’t see

$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes

-D stream.map.output=typedbytes

-D stream.reduce.input=typedbytes

-D stream.reduce.output=typedbytes

-D mapred.reduce.tasks=0

-input /tmp/RtmpUUrNMj/file68c0185e60c

-output /tmp/RtmpUUrNMj/file68c04c25d5f0

-mapper \"Rscript rmr-streaming-map68c018acf680 \"

-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a

-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080

-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680

-inputformat org.apache.hadoop.streaming.AutoInputFormat

-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat 2>&1

Page 17: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table of Contents

1 Intro to Hadoop & R

2 Counting (& Graphics)

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modelling

6 Final remarks

Page 18: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Scoring

• Externals consultants build a model (using R and small

data)

• Models in R should have a predict method

• You can then score your huge database (in batch)

• No need to rewrite the model into your systems!

Page 19: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

The case for sampling

• Sampling works!

• Sampled datasets can be used to build small data models

• You can use R (& mapreduce) to sample data, but youbetter not

Page 20: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Running simulations on Hadoop

• Some (many?) people say it is not the right tool

• You need input data, but simulations often not

• You want to control the number of mappers (which runyour simulations)

• Still mapreduce is nice for simulations...

• ... so let and old dog try its dirty trick!

Page 21: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table of Contents

1 Intro to Hadoop & R

2 Counting (& Graphics)

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modellingLinear RegressionLogistic RegressionTrees & Random Forests

6 Final remarks

Page 22: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Linear regression can beparallelized

Simple linear regression: y ∼ α + βx

β =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=

=

∑ni=1 xiyi −

1n

∑ni=1 xi

∑nj=1 yj∑n

i=1(x2i )− 1n (∑n

i=1 xi )2

Operations are case by case!

Page 23: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Multiple linear regression

• Based on X ′X and X ′y :

β = (X ′X )−1X ′y

• If X ′ = [X1|...|Xn] (by blocks), then X ′X =∑

i XiX′i .

Page 24: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Can logistic regression beparallelized? Yes and no.

• Fitting logistic regression models is iterative and iterationsare not parallelizable.

• However, each iteration can be parallelized (these are notunlike fitting linear models as before)

• We will explore two big data alternatives:• Parallelize iterations using mapreduce (seehttp://goo.gl/ftx36r)

• Split your data meaningfully and do standard logisticregression in the nodes

Page 25: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

How many bytes make knowledge?(aka the fractal nature of big data)

Page 26: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Splitted logistic regression

Page 27: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Viable alternatives to logisticmodels

• Trees• High interpretability• But unstable and tend to miss out details

• Random forests• Black boxes• Superb performance• These are collections of trees that can be built in parallel

• Both can be parallelized indifferent ways:• Similar to partitioned logistic models above• Within training

Page 28: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Table of Contents

1 Intro to Hadoop & R

2 Counting (& Graphics)

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modelling

6 Final remarks

Page 29: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Forget most of what you learnedtoday, seriously

• People strive to extend small data models to big data (aswe did today)...

• ... but is it the way to go?

• Achtung microlocal structure• Small data people knows microlocal structure as outliers• Global models (linear, logistic,...) cannot (easily?) exploit

microlocal structure• But the promises of big data lie precisely there• (Otherwise, just sample and you will be fine)

• Areas to watch for insights on big data modelling:• SNA (networks analysis)• Text analysis

Page 30: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Thank you very much and...

... questions?

Page 31: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Recommended