10-605: Map-Reduce Workflows
William Cohen
1
PARALLELIZING STREAM AND SORT
2
• example 1• example 2• example 3•….
Counting logic
Counter Machine 1
Part
ition
/Sor
t
Spill 1
Spill 2
Spill 3
…
• example 1• example 2• example 3•….
Counting logic
Counter Machine 2 Part
ition
/Sor
t
Spill 1
Spill 2
Spill 3
…
• C[x1] += D1• C[x1] += D2•….
Logic to combine counter updates
Reducer Machine 1
Mer
ge S
pill
File
s
• C[x2] += D1• C[x2] += D2•….
combine counter updates
Reducer Machine 2
Mer
ge S
pill
File
s
Spill n
3
Stream and Sort Counting à Distributed Counting
• example 1• example 2• example 3•….
Counting logic
Map Process 1
Part
ition
/Sor
t
Spill 1
Spill 2
Spill 3
…
• example 1• example 2• example 3•….
Counting logic
Map Process 2 Part
ition
/Sor
t
Spill 1
Spill 2
Spill 3
…
• C[x1] += D1• C[x1] += D2•….
Logic to combine counter updates
Reducer 1
Mer
ge S
pill
File
s
• C[x2] += D1• C[x2] += D2•….
combine counter updates
Reducer 2
Mer
ge S
pill
File
s
Spill n
4
Distributed Stream-and-Sort: Map, Shuffle-Sort, Reduce
Distributed Shuffle-Sort
Combiners in Hadoop
5
Some of this is wasteful• Remember - moving data around and
writing to/reading from disk are very expensive operations
• No reducer can start until:• all mappers are done • data for its partition has been received• data in its partition has been sorted
6
How much does buffering help?
BUFFER_SIZE Time Message Sizenone 1.7M words100 47s 1.2M1,000 42s 1.0M10,000 30s 0.7M100,000 16s 0.24M1,000,000 13s 0.16Mlimit 0.05M
Recall idea here: in stream-and-sort, use a buffer to accumulate counts in messages for common words before the sort so sort input was smaller
7
Combiners• Sits between the map and the shuffle– Do some of the reducing while you’re waiting for
other stuff to happen– Avoid moving all of that data over the network– Eg, for wordcount: instead of sending (word,1)
send (word,n) where n is a partial count (over data seen by that mapper)• Reducer still just sums the counts
• Only applicable when – order of reduce operations doesn’t matter (since
order is undetermined)– effect is cumulative
8
9
job.setCombinerClass(Reduce.class);
10
Deja vu: Combiner = Reducer• Often the combiner is the reducer.– like for word count–but not always
– remember you have no control over when/whether the combiner is applied
11
Algorithms in Map-Reduce
(Workflows)
12
Scalable out-of-core classification (of large test sets)
can we do betterthat the current approach?
13
Testing Large-vocab Naïve Bayes
• For each example id, y, x1,….,xd in train:• Sort the event-counter update “messages”• Scan and add the sorted messages and output the final
counter values• Initialize a HashSet NEEDED and a hashtable C• For each example id, y, x1,….,xd in test:– Add x1,….,xd to NEEDED
• For each event, C(event) in the summed counters– If event involves a NEEDED term x read it into C
• For each example id, y, x1,….,xd in test:– For each y’ in dom(Y):
• Compute log Pr(y’,x1,….,xd) = ….
[For assignment]
14
test is small
collection of event counts is big
Can we do better?
id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
Test data Event counts
id1 w1,1 w1,2 w1,3 …. w1,k1
id2 w2,1 w2,2 w2,3 ….
id3 w3,1 w3,2 ….
id4 w4,1 w4,2 …
C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]
C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]
C[X=w3,1^Y=….]=…
…
What we’d like
15
Can we do better?
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
Event counts
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
Step 1: group counters by word wHow:• Stream and sort:
• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”
• sort and build a list of values associated with each key wLike an inverted index
16
If these records were in a key-value DB we would know what to do….
id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Step 2: stream through and for each test case
idi wi,1 wi,2 wi,3 …. wi,ki
request the event counters needed to classify idi from the event-count DB, then classify using the answers
Classification logic
17
Is there a stream-and-sort analog of this request-and-answer pattern?
id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Step 2: stream through and for each test case
idi wi,1 wi,2 wi,3 …. wi,ki
request the event counters needed to classify idi from the event-count DB, then classify using the answers
Classification logic
18
Recall: Stream and Sort Counting: sort messages so the recipient can stream through them
• example 1• example 2• example 3•….
Counting logic
“C[x] +=D”
Machine A
Sort
• C[x1] += D1• C[x1] += D2•….
Logic to combine counter updates
Machine C
Machine B19
Is there a stream-and-sort analog of this request-and-answer pattern?
id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Classification logic
W1,1 counters to id1W1,2 counters to id1…Wi,j counters to idi
…
20
Is there a stream-and-sort analog of this request-and-answer pattern?
id1 found an aarvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Classification logic
found ctrs to id1aardvark ctrs to id1…today ctrs to id1
…
21
Is there a stream-and-sort analog of this request-and-answer pattern?
id1 found an aarvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Classification logic
found ~ctrs to id1aardvark ~ctrs to id1…today ~ctrs to id1
…
~ is the last ascii character
% export LC_COLLATE=C
means that it will sort after anything else with unix sort
22
Is there a stream-and-sort analog of this request-and-answer pattern?
id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Classification logic
found ~ctr to id1aardvark ~ctr to id2…today ~ctr to idi
…
Counter records
requestsCombine and sort
23
A stream-and-sort analog of the request-and-answer pattern…
Record of all event counts for each wordw Countsaardvark C[w^Y=sports]=2
agent …
…zynga …
found ~ctr to id1aardvark ~ctr to id1…today ~ctr to id1
…
Counter records
requestsCombine and sort
w Countsaardvark C[w^Y=sports]=2
aardvark ~ctr to id1
agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]
zynga ~ctr to id1
Request-handling logic24
A stream-and-sort analog of the request-and-answer pattern…
requestsCombine and sort
w Countsaardvark C[w^Y=sports]=2
aardvark ~ctr to id1
agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]
zynga ~ctr to id1
Request-handling logic
•previousKey = somethingImpossible• For each (key,val) in input:
• If key==previousKey• Answer(recordForPrevKey,val)
• Else• previousKey = key• recordForPrevKey = val
define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”
25
A stream-and-sort analog of the request-and-answer pattern…
requestsCombine and sort
w Countsaardvark C[w^Y=sports]=2
aardvark ~ctr to id1
agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]
zynga ~ctr to id1
Request-handling logic
•previousKey = somethingImpossible• For each (key,val) in input:
• If key==previousKey• Answer(recordForPrevKey,val)
• Else• previousKey = key• recordForPrevKey = val
define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”
Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…
26
A stream-and-sort analog of the request-and-answer pattern…
w Countsaardvark C[w^Y=sports]=2
aardvark ~ctr to id1
agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]
zynga ~ctr to id1
Request-handling logic
Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…
id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..
Combine and sort ???? 27
id1 w1,1 w1,2 w1,3 …. w1,k1
id2 w2,1 w2,2 w2,3 ….
id3 w3,1 w3,2 ….
id4 w4,1 w4,2 …
C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]
C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]
C[X=w3,1^Y=….]=…
…
Key Valueid1 found aardvark zynga farmville today
~ctr for aardvark is C[w^Y=sports]=2~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564…
id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …
… …
What we’d wanted
What we ended up with … and it’s good enough!
28
Implementation summary
java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat
java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests
id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
train.dat counts.dat
29
Implementation summary
java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat
java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests
words.dat
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
30
Implementation summary
java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat
java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests
w Countsaardvark C[w^Y=sports]=2
agent …
…zynga …
found ~ctr to id1aardvark ~ctr to id2…today ~ctr to idi
…
w Countsaardvark C[w^Y=sports]=2
aardvark ~ctr to id1
agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742
output looks like thisinput looks like this
words.dat
31
Implementation summary
java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat
java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests
Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…
id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..
Output looks like this
test.dat
32
Implementation summaryjava CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat
java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat -test.dat| sort | testNBUsingRequests Input looks like this
Key Valueid1 found aardvark zynga farmville today
~ctr for aardvark is C[w^Y=sports]=2~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564…
id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …
… … 33
ABSTRACTIONS FOR MAP-REDUCE
34
Abstractions On Top Of Map-Reduce• Some obvious streaming
processes: – for each row in a table
• Transform it and output the result
• Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test
35
Example:stemwordsinastreamofword-countpairs:
(“aardvarks”,1)è (“aardvark”,1)
Proposed syntax:
table2 =MAPtable1 TOλ row :f(row))
f(row)èrow’
Example:applystopwords
(“aardvark”,1)è (“aardvark”,1)(“the”,1)è deleted
Proposed syntax:
table2 =FILTERtable1 BYλ row :f(row))
f(row)è {true,false}
Abstractions On Top Of Map-Reduce• A non-obvious? streaming
processes: – for each row in a table
• Transform it to a list of items• Splice all the lists
together to get the output table (flatten)
36
Example:tokenizingaline
“Ifoundanaardvark”è [“i”,“found”,”an”,”aardvark”]“Welovezymurgy”è [“we”,”love”,”zymurgy”]
..butfinaltableisonewordperrow
“i”“found”“an”“aardvark”“we”“love”…
Proposed syntax:
table2 =FLATMAPtable1 TOλ row :f(row))
f(row)èlist of rows
NB Test Step
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
Event counts
How:• Stream and sort:
• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”
• sort and build a list of values associated with each key wLike an inverted index
w Counts associated with Waardvark C[w^Y=sports]=2agent C[w^Y=sports]=1027,C[w^Y=world
News]=564
… …zynga C[w^Y=sports]=21,C[w^Y=worldNe
ws]=4464 37
NB Test Step
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
Event counts
w Counts associated with Waardvark C[w^Y=sports]=2agent C[w^Y=sports]=1027,C[w^Y=world
News]=564
… …zynga C[w^Y=sports]=21,C[w^Y=worldNe
ws]=4464
The general case:We’re taking rows from a table• In a particular format (event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value
èGrouping operationSpecial case of a map-reduce
Proposed syntax:
GROUPtable BYλ row :f(row)
Coulddefinef via:afunction,afieldofadefinedrecord structure,…
f(row)èfield
38
NB Test StepThe general case:We’re taking rows from a table• In a particular format (event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value
èGrouping operationSpecial case of a map-reduce
Proposed syntax:
GROUPtable BYλ row :f(row)
Coulddefinef via:afunction,afieldofadefinedrecord structure,…
f(row)èfield
Aside: you guys know how to implement this, right?
1. Output pairs (f(row),row) with a map/streaming process
2. Sort pairs by key – which is f(row)
3. Reduce and aggregate by appending together all the values associated with the same key
39
Abstractions On Top Of Map-Reduce• And another example from the Naïve Bayes
test program…
40
Request-and-answer
id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
Test data Record of all event counts for each word
w Counts associated with Waardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews
Step 2: stream through and for each test case
idi wi,1 wi,2 wi,3 …. wi,ki
request the event counters needed to classify idi from the event-count DB, then classify using the answers
Classification logic
41
Request-and-answer
• Break down into stages– Generate the data being requested (indexed by key, here
a word)• Eg with group … by
– Generate the requests as (key, requestor) pairs• Eg with flatmap … to
– Join these two tables by key• Join: conceptually defined as (1) cross-product and (2) filter out
pairs with different values for keys • Join: implemented by concatenating two different tables of key-
value pairs, and reducing them together– Postprocess the joined result
42
w Countersaardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
w Counters Requestsaardvark C[w^Y=sports]=2 ~ctr to id1
agent C[w^Y=sports]=… ~ctr to id345agent C[w^Y=sports]=… ~ctr to id9854agent C[w^Y=sports]=… ~ctr to id345… C[w^Y=sports]=… ~ctr to id34742zynga C[…] ~ctr to id1
zynga C[…] …
w Requestfound ~ctr to id1
aardvark ~ctr to id1
…zynga ~ctr to id1… ~ctr to id2
43
w Countersaardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
w Counters Requestsaardvark C[w^Y=sports]=2 id1
agent C[w^Y=sports]=… id345agent C[w^Y=sports]=… id9854agent C[w^Y=sports]=… id345… C[w^Y=sports]=… id34742zynga C[…] id1
zynga C[…] …
w Requestfound id1
aardvark id1
…zynga id1… id2
Proposed syntax:
JOINtable1 BYλ row :f(row),table2BY λ row :g(row)
Examples:
JOINwordInDoc BYword,wordCounters BY word--- ifword(row)definedcorrectly
JOINwordInDoc BYlambda(word,docid):word,wordCounters BYlambda(word,counters):word– usingpythonsyntaxforfunctions
44
Abstract Implementation: [TF]IDF
data = pairs (docid ,term) where term is a word appears in document with id docidoperators:• DISTINCT, MAP, JOIN• GROUP BY …. [RETAINING …] REDUCING TO a reduce step
docFreq = DISTINCT data| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */
docIds = MAP DATA BY=λ(docid,term):docid | DISTINCTnumDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */
dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term
| MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */
unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1| MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df))/* (docId, term, weight-before-normalizing) : u */
1/2
docId termd123 foundd123 aardvark
key valuefound (d123,found),(d134,found),…aardvark (d123,aardvark),…
key value1 12451
key valuefound (d123,found),(d134,found),… 2456aardvark (d123,aardvark),… 7
45
question – how many reducers should I use here?
Abstract Implementation: [TF]IDF
data = pairs (docid ,term) where term is a word appears in document with id docidoperators:• DISTINCT, MAP, JOIN• GROUP BY …. [RETAINING …] REDUCING TO a reduce step
docFreq = DISTINCT data| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */
docIds = MAP DATA BY=λ(docid,term):docid | DISTINCTnumDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */
dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term
| MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */
unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1| MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df))/* (docId, term, weight-before-normalizing) : u */
1/2
docId termd123 foundd123 aardvark
key valuefound (d123,found),(d134,found),…aardvark (d123,aardvark),…
key value1 12451
key valuefound (d123,found),(d134,found),… 2456aardvark (d123,aardvark),… 7
46
question – how many reducers should I use here?
question – how many reducers should I use here?
Abstract Implementation: TFIDF
normalizers = GROUP unnormalizedDocVecs BY λ(docId,term,w):docidRETAINING λ(docId,term,w): w2
REDUCING TO sum /* (docid,sum-of-square-weights) */
docVec = JOIN unnormalizedDocVecs BY λ(docId,term,w):docid,normalizers BY λ(docId,norm):docid
| MAP λ((docId,term,w), (docId,norm)): (docId,term,w/sqrt(norm))/* (docId, term, weight) */
2/2
keyd1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234d3214 ….
keyd1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234d3214 …. 29.654
docId term wd1234 found 1.542d1234 aardvark 13.23
docId wd1234 37.234d1234 37.234 47
Two ways to join
• Reduce-side join• Map-side join
48
Two ways to join
• Reduce-side join for A,B
49
term dffound 2456aardvark 7…
term docIdaardvark d15… ...found d7found d23found ……
A
B
concatand sort
A B(aardvark, 7) (aardvark,
d15)… ...
(found,2456) (found,d7)
(found,2456) (found,d23)
… …
do the join
(
Two ways to join
• Reduce-side join for A,B
50
term dffound 2456aardvark 7
term docIdaardvark d15
...found d7found d23found …
concatand sort
A B(aardvark, 7) (aardvark,
d15)… ...
(found,2456) (found,d7)
(found,2456) (found,d23)
… …
do the join
(
term dffound A 2456aardvark A 7…
term docId
aardvark B d15… ...found B d7found B d23found B ……
tricky bit: need sort by firsttwo values (aardvark, AB) –we want the DF’s to come first
but all tuples with key “aardvark” should go to same worker
Two ways to join
• Reduce-side join for A,B
51
term dffound 2456aardvark 7
term docIdaardvark d15
...found d7found d23found …
concatand sort
term dffound A 2456aardvark A 7…
term docId
aardvark B d15… ...found B d7found B d23found B ……
tricky bit: need sort by firsttwo values (aardvark, AB) –we want the DF’s to come first
but all tuples with key “aardvark” should go to same worker
custom sort (secondary sort key):Writeable with your own Comparator
custom Partitioner(specified for job like the Mapper, Reducer, ..)
Two ways to join• Map-side join–write the smaller relation out to disk– send it to each Map worker• DistributedCache
–when you initialize each Mapper, load in the small relation• Configure(…) is called at initialization time
–map through the larger relation and do the join– faster but requires one relation to go in
memory
52
Two ways to join
• Map-side join for A (small) and B (large)
53
term df
found 2456
aardvark 7
…
term docIdaardvark d15… ...found d7found d23found ……
A
B
Load into mapper A B
(aardvark, 7) (aardvark, d15)
… ...
(found,2456) (found,d7)(found,2456) (found,d23)… …
Join as you map
(
Two ways to join
• Map-side join for A (small) and B (large)
54
term df
found 2456
aardvark 7
…
term docIdaardvark d15… ...
A
B1
Duplicate and load into every mapper
(found,2456) (found,d7)(found,2456) (found,d23)… …
Join as you map
(
B2 found d7found d23found ……
A B(aardvark, 7) (aardvark,
d15)… ...
PIG: A WORKFLOW LANGUAGE
55
PIG: word count example
PIG program is a bunch of assignments where every LHS is a relation.No loops, conditionals, etc allowed.
56
57
Tokenize – built-in function
Flatten – special keyword, which applies to the nextstep in the process – so output is a stream of words w/o document boundaries
Built-in regex matching ….
58
Group by … foreach generate count(…) will be optimized into a single map-reduce
Group produces a stream of bags of identical words… bags, tuples, ictionaries are primitive types