+ All Categories
Home > Documents > Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce...

Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce...

Date post: 09-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
93
Other Map-Reduce (ish) Frameworks William Cohen 1
Transcript
Page 1: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Other Map-Reduce (ish) Frameworks

William  Cohen

1

Page 2: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Outline

•  More  concise  languages  for  map-­‐reduce  pipelines

•  Abstractions  built  on  top  of  map-­‐reduce – General  comments – Speci<ic  systems

• Cascading,  Pipes • PIG,  Hive •  Spark,  Flink

2

Page 3: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Y:Y=Hadoop+X or Hadoop~=Y

•  What  else  are  people  using? – instead  of  Hadoop

• Not  really  covered  this  lecture – on  top  of  Hadoop

3

Page 4: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Issues with Hadoop •  Too  much  typing

– programs  are  not  concise •  Too  low  level

– missing  abstractions – hard  to  specify  a  work<low

•  Not  well  suited  to  iterative  operations – E.g.,  E/M,  k-­‐means  clustering,  … – Work<low  and  memory-­‐loading  issues

4

Page 5: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

STREAMING AND MRJOB: MORE CONCISE MAP-REDUCE

PIPELINES

5

Page 6: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Hadoop streaming •  start  with  stream  &  sort  pipeline cat data | mapper.py | sort –k1,1 | reducer.py

•  run  with  hadoop  streaming  instead bin/hadoop  jar  contrib/streaming/hadoop-­‐*streaming*.jar  

-­‐<ile  mapper.py  –<ile  reducer.py -­‐mapper  mapper.py   -­‐reducer  reducer.py   -­‐input  /hdfs/path/to/inputDir -­‐output  /hdfs/path/to/outputDir -­‐mapred.map.tasks=20 -­‐mapred.reduce.tasks=20

6

Page 7: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

mrjob word count •  Python  level  over  map-­‐reduce  –  very  concise

•  Can  run  locally  in  Python

•  Allows  a  single  job  or  a  linear  chain  of  steps

7

Page 8: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

mrjob most freq word

8

Page 9: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

MAP-REDUCE ABSTRACTIONS

9

Page 10: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Abstractions On Top Of Hadoop

•  MRJob  and  other  tools  to  make  Hadoop  pipelines  more  concise  (Dumbo,  …)  still  keep  the  same  basic  language  of  map-­‐reduce  jobs

•  How  else  can  we  express  these  sorts  of  computations?  Are  there  some  common  special  cases  of  map-­‐reduce  steps  we  can  parameterize  and  reuse?

10

Page 11: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Abstractions On Top Of Hadoop •  Some  obvious  streaming  processes:   –  for  each  row  in  a  table

•  Transform  it  and  output  the  result

•  Decide  if  you  want  to  keep  it  with  some  boolean  test,  and  copy  out  only  the  ones  that  pass  the  test

11

Example:  stem  words  in  a  stream  of  word-­‐count  pairs: (“aardvarks”,1)  è  (“aardvark”,1)  

Proposed syntax: table2  =  MAP  table1    TO  λ  row  :  f(row))    

f(row)!row’

Example:  apply  stop  words (“aardvark”,1)  è    (“aardvark”,1) (“the”,1)  è  deleted

Proposed syntax: table2  =  FILTER  table1    BY  λ  row  :  f(row))    

f(row)! {true,false}

Page 12: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Abstractions On Top Of Hadoop •  A  non-­‐obvious?  streaming  processes:   –  for  each  row  in  a  table

•  Transform  it  to  a  list  of  items

•  Splice  all  the  lists  together  to  get  the  output  table  (<latten)

12

Example:  tokenizing  a  line “I  found  an  aardvark”  è  [“i”,  “found”,”an”,”aardvark”] “We  love  zymurgy”  è  [“we”,”love”,”zymurgy”] ..but  <inal  table  is  one  word  per  row

“i” “found” “an” “aardvark” “we” “love” …

Proposed syntax: table2  =  FLATMAP  table1    TO  λ  row  :  f(row))    

f(row)!list of rows

Page 13: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Abstractions On Top Of Hadoop

•  Another  example  from  the  Naïve  Bayes  test  program…

13

Page 14: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

NB Test Step (Can we do better?)

X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… …

5245 1054 2120

37 3

Event counts

How: •  Stream and sort:

•  for each C[X=w^Y=y]=n •  print “w C[Y=y]=n”

•  sort and build a list of values associated with each key w Like an inverted index

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… … zynga C[w^Y=sports]=21,C[w^Y=worldNe

ws]=4464

Page 15: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

NB Test Step 1 (Can we do better?)

X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… …

5245 1054 2120

37 3

Event counts

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… … zynga C[w^Y=sports]=21,C[w^Y=worldNe

ws]=4464

The general case: We’re taking rows from a table •  In a particular format (event,count) Applying a function to get a new value •  The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce

Proposed syntax: GROUP  table    BY    λ  row  :  f(row)   Output:    key,  listOfRowsWithkey Could  de<ine  f    via:  a  function,  a  <ield  of  a  de<ined  record  structure,  …

f(row)!field

Page 16: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

NB Test Step 1 (Can we do better?)

The general case: We’re taking rows from a table •  In a particular format (event,count) Applying a function to get a new value •  The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce

Proposed syntax: GROUP  table    BY    λ  row  :  f(row)   Could  de<ine  f    via:  a  function,  a  <ield  of  a  de<ined  record  structure,  …

f(row)!field

Aside: you guys know how to implement this, right?

1.  Output pairs (f(row),row) with a map/streaming process

2.  Sort pairs by key – which is f(row)

3.  Reduce and aggregate by appending together all the values associated with the same key

Page 17: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Abstractions On Top Of Hadoop

•  And another example from the Naïve Bayes test program…

17

Page 18: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Request-and-answer

id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. ..

Test data Record of all event counts for each word

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Step 2: stream through and for each test case idi wi,1 wi,2 wi,3 …. wi,ki request the event counters needed to classify idi from the event-count DB, then classify using the answers

Classification logic

Page 19: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Request-and-answer

•  Break down into stages –  Generate the data being requested (indexed by key, here

a word) •  Eg with group … by

–  Generate the requests as (key, requestor) pairs •  Eg with flatmap … to

–  Join these two tables by key •  Join defined as (1) cross-product and (2) filter out pairs with

different values for keys •  This replaces the step of concatenating two different tables of

key-value pairs, and reducing them together

–  Postprocess the joined result

Page 20: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

w Counters

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

w Counters Requests

aardvark C[w^Y=sports]=2 ~ctr to id1

agent C[w^Y=sports]=… ~ctr to id345

agent C[w^Y=sports]=… ~ctr to id9854

agent C[w^Y=sports]=… ~ctr to id345

… C[w^Y=sports]=… ~ctr to id34742

zynga C[…] ~ctr to id1

zynga C[…] …

w Request

found ~ctr to id1

aardvark ~ctr to id1

… zynga ~ctr to id1 … ~ctr to id2

Page 21: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

w Counters

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

w Counters Requests

aardvark C[w^Y=sports]=2 id1

agent C[w^Y=sports]=… id345

agent C[w^Y=sports]=… id9854

agent C[w^Y=sports]=… id345

… C[w^Y=sports]=… id34742

zynga C[…] id1

zynga C[…] …

w Request

found id1

aardvark id1

… zynga id1 … id2

Proposed syntax: table3  =  JOIN  table1    BY    λ  row  :  f(row),  table2  BY    λ  row  :  g(row) Could  de<ine  f    and  g  via:  a  function,  a  <ield  of  a  de<ined  record  structure,  …

Page 22: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

MAP-REDUCE ABSTRACTIONS: CASCADING, PIPES, SCALDING

22

Page 23: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Y:Y=Hadoop+X

•  Cascading – Java  library  for  map-­‐reduce  work<lows – Also  some  library  operations  for  common  mappers/reducers

23

Page 24: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Cascading WordCount Example

24

Input format

Output format: pairs

Bind to HFS path

Bind to HFS path A pipeline of map-reduce jobs

Append a step: apply function to the “line” field

Append step: group a (flattened) stream of “tuples”

Replace line with bag of words

Append step: aggregate grouped values

Run the pipeline

Page 25: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Cascading WordCount Example

Is this inefficient? We explicitly form a group for each word, and then count the elements…?

We could be saved by careful optimization: we know we don’t need the GroupBy intermediate result when we run the assembly….

Many of the Hadoop abstraction levels have a similar flavor: •  Define a pipeline of tasks declaratively •  Optimize it automatically •  Run the final result

The key question: does the system successfully hide the details from you?

Page 26: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Cascading WordCount Example

Another pipeline: words  =  FLATMAP  docs  BY    λ  d:  tokenize(  d) contentWords  =  FILTER  words  BY  λ  w:  !contains(stopwords,w) stems  =  MAP  contentWords  BY  λ  w:  stem(w) stemGroups  =  GROUP  stems  BY    λ  s:  s stemCounts  =  MAP  stemGroups  BY  

λ  stem,listOfStems:  (stem,listOfStems.length()) … How many passes do we need to make over the data?

Many of the Hadoop abstraction levels have a similar flavor: •  Define a pipeline of tasks declaratively •  Optimize it automatically •  Run the final result

The key question: does the system successfully hide the details from you?

Optimize to one reduce

Page 27: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Y:Y=Hadoop+X •  Cascading

–  Java  library  for  map-­‐reduce  work<lows •  expressed  as  “Pipe”s, to  which  you  add  Each, Every,

GroupBy, … – Also  some  library  operations  for  common  mappers/reducers

•  e.g.  RegexGenerator – Turing-­‐complete  since  it’s  an  API  for  Java

•  Pipes – C++  library  for  map-­‐reduce  work<lows  on  Hadoop

•  Scalding – More  concise  Scala  library  based  on  Cascading

27

Page 28: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

MORE DECLARATIVE LANGUAGES

28

Page 29: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Hive and PIG: word count

•  Declarative  …..  Fairly  stable

PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 29

Page 30: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

More on Pig

•  Pig  Latin – atomic  types  +  compound  types  like  tuple,  bag,  map

– execute  locally/interactively  or  on  hadoop •  can  embed  Pig  in  Java  (and  Python  and  …)   •  can  call  out  to  Java  from  Pig •  Similar  (ish)  system  from  Microsoft:  DryadLinq

30

Page 31: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

31

Tokenize – built-in function

Flatten – special keyword, which applies to the next step in the process – ie foreach is transformed from a MAP to a FLATMAP

Page 32: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  LOAD  ‘hdfs-­‐path’  AS  (schema)

–  schemas  can  include  int,  double,  bag,  map,  tuple,  … •  FOREACH  alias  GENERATE  …  AS  …,  …

–  transforms  each  row  of  a  relation •  DESCRIBE  alias/  ILLUSTRATE  alias  -­‐-­‐  debugging •  GROUP  alias  BY  … •  FOREACH  alias  GENERATE  group,  SUM(….)

–  GROUP/GENERATE  …  aggregate  op  together  act  like  a  map-­‐reduce

•  JOIN  r  BY  Nield,  s  BY  Nield,  … –  inner  join  to  produce  rows:  r::f1,  r::f2,  …  s::f1,  s::f2,  …

•  CROSS  r,  s,  … –  use  with  care  unless  all  but  one  of  the  relations  are  singleton

•  User  de<ined  functions  as  operators –  also  for  loading,  aggregates,  …

32

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce

Page 33: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

ANOTHER EXAMPLE: COMPUTING TFIDF IN PIG LATIN

33

Page 34: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

(docid,token) è (docid,token,tf(token in doc))

(docid,token,tf) è (docid,token,tf,length(doc))

(docid,token,tf,n)è(…,tf/n)

(docid,token,tf,n,tf/n)è(…,df)

ndocs.total_docs

(docid,token,tf,n,tf/n)è(docid,token,tf/n * id)

relation-to-scalar casting

34

Page 35: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Other PIG features

•  … •  Macros,  nested  queries, •  FLATTEN  “operation”

– transforms  a  bag  or  a  tuple  into  its  individual  elements

– this  transform  affects  the  next  level  of  the  aggregate

•  STREAM  and  DEFINE  …  SHIP DEFINE myfunc `python myfun.py` SHIP(‘myfun.py’)

… r = STREAM s THROUGH myfunc AS (…);

35

Page 36: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

TF-IDF in PIG - another version

36

Page 37: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Issues with Hadoop •  Too  much  typing

– programs  are  not  concise •  Too  low  level

– missing  abstractions – hard  to  specify  a  work<low

•  Not  well  suited  to  iterative  operations – E.g.,  E/M,  k-­‐means  clustering,  … – Work<low  and  memory-­‐loading  issues

37

First: an iterative algorithm in Pig Latin

Page 38: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Julien Le Dem - Yahoo

How to use loops, conditionals, etc? Embed PIG in a real programming language.

38

Page 39: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

39

Page 40: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

An example from Ron Bekkerman

40

Page 41: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Example: k-means clustering

•  An  EM-­‐like  algorithm: •  Initialize  k  cluster  centroids •  E-­‐step:  associate  each  data  instance  with  the  closest  centroid – Find  expected  values  of  cluster  assignments  given  the  data  and  centroids

•  M-­‐step:  recalculate  centroids  as  an  average  of  the  associated  data  instances – Find  new  centroids  that  maximize  that  expectation

41

Page 42: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

k-means Clustering

centroids

42

Page 43: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Parallelizing k-means

43

Page 44: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Parallelizing k-means

44

Page 45: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Parallelizing k-means

45

Page 46: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

k-means on MapReduce

•  Mappers  read  data  portions  and  centroids •  Mappers  assign  data  instances  to  clusters •  Mappers  compute  new  local  centroids  and  local  cluster  sizes

•  Reducers  aggregate  local  centroids  (weighted  by  local  cluster  sizes)  into  new  global  centroids

•  Reducers  write  the  new  centroids

Panda et al, Chapter 2

46

Page 47: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

k-means in Apache Pig: input data

•  Assume  we  need  to  cluster  documents – Stored  in  a  3-­‐column  table  D:

•  Initial  centroids  are  k  randomly  chosen  docs – Stored  in  table  C  in  the  same  format  as  above

Document Word Count

doc1 Carnegie 2

doc1 Mellon 2

47

Page 48: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

D_C  =  JOIN  C  BY  w,  D  BY  w;  PROD  =  FOREACH  D_C  GENERATE  d,  c,  id  *  ic  AS  idic  ;    

PRODg  =  GROUP  PROD  BY  (d,  c);  DOT_PROD  =  FOREACH  PRODg  GENERATE  d,  c,  SUM(idic)  AS  dXc;    

SQR  =  FOREACH  C  GENERATE  c,  ic  *  ic  AS  ic2;  SQRg  =  GROUP  SQR  BY  c;  LEN_C  =  FOREACH  SQRg  GENERATE  c,  SQRT(SUM(ic2))  AS  lenc;    

DOT_LEN  =  JOIN  LEN_C    BY  c,  DOT_PROD  BY  c;  SIM  =  FOREACH  DOT_LEN  GENERATE  d,  c,  dXc  /  lenc;    

SIMg  =  GROUP  SIM  BY  d;  CLUSTERS  =  FOREACH  SIMg  GENERATE  TOP(1,  2,  SIM);  

k-means in Apache Pig: E-step

( )∑∑

∈⋅

=

cwwc

dwwc

wd

cdi

iic

2maxarg

48

Page 49: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

D_C  =  JOIN  C  BY  w,  D  BY  w;  PROD  =  FOREACH  D_C  GENERATE  d,  c,  id  *  ic  AS  idic  ;    

PRODg  =  GROUP  PROD  BY  (d,  c);  DOT_PROD  =  FOREACH  PRODg  GENERATE  d,  c,  SUM(idic)  AS  dXc;    

SQR  =  FOREACH  C  GENERATE  c,  ic  *  ic  AS  ic2;  SQRg  =  GROUP  SQR  BY  c;  LEN_C  =  FOREACH  SQRg  GENERATE  c,  SQRT(SUM(ic2))  AS  lenc;    

DOT_LEN  =  JOIN  LEN_C    BY  c,  DOT_PROD  BY  c;  SIM  =  FOREACH  DOT_LEN  GENERATE  d,  c,  dXc  /  lenc;    

SIMg  =  GROUP  SIM  BY  d;  CLUSTERS  =  FOREACH  SIMg  GENERATE  TOP(1,  2,  SIM);  

k-means in Apache Pig: E-step

( )∑∑

∈⋅

=

cwwc

dwwc

wd

cdi

iic

2maxarg

49

Page 50: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

D_C  =  JOIN  C  BY  w,  D  BY  w;  PROD  =  FOREACH  D_C  GENERATE  d,  c,  id  *  ic  AS  idic  ;    

PRODg  =  GROUP  PROD  BY  (d,  c);  DOT_PROD  =  FOREACH  PRODg  GENERATE  d,  c,  SUM(idic)  AS  dXc;    

SQR  =  FOREACH  C  GENERATE  c,  ic  *  ic  AS  ic2;  SQRg  =  GROUP  SQR  BY  c;  LEN_C  =  FOREACH  SQRg  GENERATE  c,  SQRT(SUM(ic2))  AS  lenc;    

DOT_LEN  =  JOIN  LEN_C    BY  c,  DOT_PROD  BY  c;  SIM  =  FOREACH  DOT_LEN  GENERATE  d,  c,  dXc  /  lenc;    

SIMg  =  GROUP  SIM  BY  d;  CLUSTERS  =  FOREACH  SIMg  GENERATE  TOP(1,  2,  SIM);  

k-means in Apache Pig: E-step

( )∑∑

∈⋅

=

cwwc

dwwc

wd

cdi

iic

2maxarg

50

Page 51: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

D_C  =  JOIN  C  BY  w,  D  BY  w;  PROD  =  FOREACH  D_C  GENERATE  d,  c,  id  *  ic  AS  idic  ;    

PRODg  =  GROUP  PROD  BY  (d,  c);  DOT_PROD  =  FOREACH  PRODg  GENERATE  d,  c,  SUM(idic)  AS  dXc;    

SQR  =  FOREACH  C  GENERATE  c,  ic  *  ic  AS  ic2;  SQRg  =  GROUP  SQR  BY  c;  LEN_C  =  FOREACH  SQRg  GENERATE  c,  SQRT(SUM(ic2))  AS  lenc;    

DOT_LEN  =  JOIN  LEN_C    BY  c,  DOT_PROD  BY  c;  SIM  =  FOREACH  DOT_LEN  GENERATE  d,  c,  dXc  /  lenc;    

SIMg  =  GROUP  SIM  BY  d;  CLUSTERS  =  FOREACH  SIMg  GENERATE  TOP(1,  2,  SIM);  

k-means in Apache Pig: E-step

( )∑∑

∈⋅

=

cwwc

dwwc

wd

cdi

iic

2maxarg

51

Page 52: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

D_C  =  JOIN  C  BY  w,  D  BY  w;  PROD  =  FOREACH  D_C  GENERATE  d,  c,  id  *  ic  AS  idic  ;    

PRODg  =  GROUP  PROD  BY  (d,  c);  DOT_PROD  =  FOREACH  PRODg  GENERATE  d,  c,  SUM(idic)  AS  dXc;    

SQR  =  FOREACH  C  GENERATE  c,  ic  *  ic  AS  ic2;  SQRg  =  GROUP  SQR  BY  c;  LEN_C  =  FOREACH  SQRg  GENERATE  c,  SQRT(SUM(ic2))  AS  lenc;    

DOT_LEN  =  JOIN  LEN_C    BY  c,  DOT_PROD  BY  c;  SIM  =  FOREACH  DOT_LEN  GENERATE  d,  c,  dXc  /  lenc;    

SIMg  =  GROUP  SIM  BY  d;  CLUSTERS  =  FOREACH  SIMg  GENERATE  TOP(1,  2,  SIM);  

k-means in Apache Pig: E-step

( )∑∑

∈⋅

=

cwwc

dwwc

wd

cdi

iic

2maxarg

52

Page 53: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

k-means in Apache Pig: E-step

D_C  =  JOIN  C  BY  w,  D  BY  w;  PROD  =  FOREACH  D_C  GENERATE  d,  c,  id  *  ic  AS  idic  ;    

PRODg  =  GROUP  PROD  BY  (d,  c);  DOT_PROD  =  FOREACH  PRODg  GENERATE  d,  c,  SUM(idic)  AS  dXc;    

SQR  =  FOREACH  C  GENERATE  c,  ic  *  ic  AS  ic2;  SQRg  =  GROUP  SQR  BY  c;  LEN_C  =  FOREACH  SQRg  GENERATE  c,  SQRT(SUM(ic2))  AS  lenc;    

DOT_LEN  =  JOIN  LEN_C    BY  c,  DOT_PROD  BY  c;  SIM  =  FOREACH  DOT_LEN  GENERATE  d,  c,  dXc  /  lenc;    

SIMg  =  GROUP  SIM  BY  d;  CLUSTERS  =  FOREACH  SIMg  GENERATE  TOP(1,  2,  SIM);  

53

Page 54: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

k-means in Apache Pig: M-step

D_C_W  =  JOIN  CLUSTERS  BY  d,  D  BY  d;    D_C_Wg  =  GROUP  D_C_W  BY  (c,  w);  SUMS  =  FOREACH  D_C_Wg  GENERATE  c,  w,  SUM(id)  AS  sum;    D_C_Wgg  =  GROUP  D_C_W  BY  c;  SIZES  =  FOREACH  D_C_Wgg  GENERATE  c,  COUNT(D_C_W)  AS  size;    SUMS_SIZES  =  JOIN  SIZES  BY  c,  SUMS  BY  c;  C  =  FOREACH    SUMS_SIZES    GENERATE  c,  w,  sum  /  size  AS  ic  ;  

Finally - embed in Java (or Python or ….) to do the looping

54

Page 55: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

The problem with k-means in Hadoop

I/O  costs

55

Page 56: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Data is read, and model is written, with every iteration

•  Mappers  read  data  portions  and  centroids •  Mappers  assign  data  instances  to  clusters •  Mappers  compute  new  local  centroids  and  local  cluster  sizes

•  Reducers  aggregate  local  centroids  (weighted  by  local  cluster  sizes)  into  new  global  centroids

•  Reducers  write  the  new  centroids

Panda et al, Chapter 2

56

Page 57: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

SCHEMES DESIGNED FOR ITERATIVE HADOOP PROGRAMS:

SPARK AND FLINK

57

Page 58: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Spark word count example •  Research  project,  based  on  Scala  and  Hadoop •  Now  APIs  in  Java  and  Python  as  well

58

•  Familiar-looking API for abstract operations (map, flatMap, reduceByKey, …)

•  Most API calls are “lazy” – ie, counts is a data structure defining a pipeline, not a materialized table.

•  Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database)

Page 59: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Spark logistic regression example

59

Page 60: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Spark logistic regression example

•  Allows  caching  data  in  memory

60

Page 61: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Spark logistic regression example

61

Page 62: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

FLINK

•  Recent  Apache  Project  –  just  moved  to  top-­‐level  at  0.8  –  formerly  Stratosphere

62

….

Page 63: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

FLINK

•  Apache  Project  –  just  getting  started

63

….

Java API

Page 64: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

FLINK

64

Page 65: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

FLINK

•  Like  Spark,  in-­‐memory  or  on  disk •  Everything  is  a  Java  object •  Unlike  Spark,  contains  operations  for  iteration

– Allowing  query  optimization •  Very  easy  to  use  and  install  in  local  model

– Very  modular – Only  needs  Java

65

Page 66: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

MORE EXAMPLES IN PIG

66

Page 67: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Phrase Finding in PIG

67

Page 68: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Phrase Finding 1 - loading the input

68

Page 69: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

69

Page 70: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features

•  comments  -­‐-­‐  like  this  /*  or  like  this  */ •  ‘shell-­‐like’  commands:

– fs  -­‐ls  …  -­‐-­‐  any  hadoop  fs  …  command – some  shorter  cuts:  ls,  cp,  … – sh  ls  -­‐al  -­‐-­‐  escape  to  shell

70

Page 71: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

71

Page 72: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  comments  -­‐-­‐  like  this  /*  or  like  this  */ •  ‘shell-­‐like’  commands:

–  fs  -­‐ls  …  -­‐-­‐  any  hadoop  fs  …  command –  some  shorter  cuts:  ls,  cp,  … –  sh  ls  -­‐al  -­‐-­‐  escape  to  shell

•  LOAD  ‘hdfs-­‐path’  AS  (schema) –  schemas  can  include  int,  double,  … –  schemas  can  include  complex  types:  bag,  map,  tuple,  …

•  FOREACH  alias  GENERATE  …  AS  …,  … –  transforms  each  row  of  a  relation –  operators  include  +,  -­‐,  and,  or,  …   –  can  extend  this  set  easily  (more  later)

•  DESCRIBE  alias  -­‐-­‐  shows  the  schema •  ILLUSTRATE  alias  -­‐-­‐  derives  a  sample  tuple

72

Page 73: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Phrase Finding 1 - word counts

73

Page 74: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

74

Page 75: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  LOAD  ‘hdfs-­‐path’  AS  (schema)

– schemas  can  include  int,  double,  bag,  map,  tuple,  …

•  FOREACH  alias  GENERATE  …  AS  …,  … – transforms  each  row  of  a  relation

•  DESCRIBE  alias/  ILLUSTRATE  alias  -­‐-­‐  debugging •  GROUP  r  BY  x

– like  a  shufNle-­‐sort:  produces  relation  with  Nields  group  and  r,  where  r  is  a  bag  

75

Page 76: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce

76

Page 77: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  LOAD  ‘hdfs-­‐path’  AS  (schema)

–  schemas  can  include  int,  double,  bag,  map,  tuple,  … •  FOREACH  alias  GENERATE  …  AS  …,  …

–  transforms  each  row  of  a  relation •  DESCRIBE  alias/  ILLUSTRATE  alias  -­‐-­‐  debugging •  GROUP  alias  BY  … •  FOREACH  alias  GENERATE  group,  SUM(….)

–  GROUP/GENERATE  …  aggregate  op  together  act  like  a  map-­‐reduce

–  aggregates:  COUNT,  SUM,  AVERAGE,  MAX,  MIN,  …   –  you  can  write  your  own

77

Page 78: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce

78

Page 79: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Phrase Finding 3 - assembling phrase- and word-level statistics

79

Page 80: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

80

Page 81: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

81

Page 82: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  LOAD  ‘hdfs-­‐path’  AS  (schema)

–  schemas  can  include  int,  double,  bag,  map,  tuple,  … •  FOREACH  alias  GENERATE  …  AS  …,  …

–  transforms  each  row  of  a  relation •  DESCRIBE  alias/  ILLUSTRATE  alias  -­‐-­‐  debugging •  GROUP  alias  BY  … •  FOREACH  alias  GENERATE  group,  SUM(….)

–  GROUP/GENERATE  …  aggregate  op  together  act  like  a  map-­‐reduce

•  JOIN  r  BY  Nield,  s  BY  Nield,  … –  inner  join  to  produce  rows:  r::f1,  r::f2,  …  s::f1,  s::f2,  …

82

Page 83: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Phrase Finding 4 - adding total frequencies

83

Page 84: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

84

Page 85: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

How do we add the totals to the phraseStats relation?

STORE triggers execution of the query plan….

it also limits optimization

85

Page 86: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Comment: schema is lost when you store…. 86

Page 87: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  LOAD  ‘hdfs-­‐path’  AS  (schema)

–  schemas  can  include  int,  double,  bag,  map,  tuple,  … •  FOREACH  alias  GENERATE  …  AS  …,  …

–  transforms  each  row  of  a  relation •  DESCRIBE  alias/  ILLUSTRATE  alias  -­‐-­‐  debugging •  GROUP  alias  BY  … •  FOREACH  alias  GENERATE  group,  SUM(….)

–  GROUP/GENERATE  …  aggregate  op  together  act  like  a  map-­‐reduce

•  JOIN  r  BY  Nield,  s  BY  Nield,  … –  inner  join  to  produce  rows:  r::f1,  r::f2,  …  s::f1,  s::f2,  …

•  CROSS  r,  s,  … –  use  with  care  unless  all  but  one  of  the  relations  are  singleton –  newer  pigs  allow  singleton  relation  to  be  cast  to  a  scalar

87

Page 88: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

Phrase Finding 5 - phrasiness and informativeness

88

Page 89: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

How do we compute some complicated function? With a “UDF”

89

Page 90: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

90

Page 91: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

PIG Features •  LOAD  ‘hdfs-­‐path’  AS  (schema)

–  schemas  can  include  int,  double,  bag,  map,  tuple,  … •  FOREACH  alias  GENERATE  …  AS  …,  …

–  transforms  each  row  of  a  relation •  DESCRIBE  alias/  ILLUSTRATE  alias  -­‐-­‐  debugging •  GROUP  alias  BY  … •  FOREACH  alias  GENERATE  group,  SUM(….)

–  GROUP/GENERATE  …  aggregate  op  together  act  like  a  map-­‐reduce

•  JOIN  r  BY  Nield,  s  BY  Nield,  … –  inner  join  to  produce  rows:  r::f1,  r::f2,  …  s::f1,  s::f2,  …

•  CROSS  r,  s,  … –  use  with  care  unless  all  but  one  of  the  relations  are  singleton

•  User  de<ined  functions  as  operators –  also  for  loading,  aggregates,  …

91

Page 92: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

The full phrase-finding pipeline

92

Page 93: Other Map-Reduce (ish) Frameworkswcohen/10-605/beyond-hadoop.pdf · Special case of a map-reduce Proposed syntax: GROUPtableBY λ&row:&f(row) Coulddeine fvia:afunction,aield& ofadeined

93


Recommended