+ All Categories
Home > Documents > HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop...

HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop...

Date post: 14-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
21
HJHadoop An Op,mized MapReduce Run,me for Mul,core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University ACM Student Research CompeDDon SPLASH 13 1
Transcript
Page 1: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

HJ-­‐Hadoop  An  Op,mized  MapReduce  

Run,me  for  Mul,-­‐core  Systems  

Yunming  Zhang  Advised  by:  Prof.  Alan  Cox  and  Vivek  Sarkar  

Rice  University  

ACM  Student  Research  CompeDDon                  SPLASH  13    

1  

Page 2: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Hadoop  MapReduce  RunDme  

2  

Job    Starts  

Map  

…….  

Reduce  

Job  Ends  

Figure  1.  Map  Reduce  Programming  Model  

Map  

Map  Reduce  

Page 3: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Hadoop  Map  Reduce  

•  Open  source  implementaDon  of  Map  Reduce  RunDme  system  – Scalable  – Reliable  – Available  

•  Popular  plaLorm  for  big  data  analyDcs  3  

Page 4: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  

Slice  1  

 Reducer    

             

Join  Table  

<Key, Val> pairs…

Full  Lookup  Table  

Thread  1  

Map:  Look  up  a  Key  in  Lookup  Table  

Full  Lookup  Table  

Thread  1  

Slice  2  

Slices  ….  

Slice  n  

<Key, Val> pairs… Slice  1  

 Reducer    

             

Join  Table  

<Key, Val> pairs…

Full  Lookup  Table  

Thread  1  

Map:  Look  up  a  Key  in  Lookup  Table  

Full  Lookup  Table  

Thread  1  

Slice  2  

Slices  ….  

Slice  n  

<Key, Val> pairs…

Join  Table  

ComputaDon   Memory  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Look  up  key  

Machine1  

Duplicated  Tables  

Reducer1  

Reducer1  Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Machine2  

Machines…..  

Map  task  in  a  JVM  

Join  Table  

ComputaDon   Memory  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Look  up  key  

Machine1  

Duplicated  Tables  

Reducer1  

Reducer1  Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Machine2  

Machines…..  

Map  task  in  a  JVM  

Join  Table  

ComputaDon   Memory  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Look  up  key  

Machine1  

Duplicated  Tables  

Reducer1  

Reducer1  Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Machine2  

Machines…..  

Map  task  in  a  JVM  

Join  Table  

ComputaDon   Memory  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Look  up  key  

Machine1  

Duplicated  Tables  

Reducer1  

Reducer1  Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Machine2  

Machines…..  

Map  task  in  a  JVM  

Join  Table  

ComputaDon   Memory  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Look  up  key  

Machine1  

Duplicated  Tables  

Reducer1  

Reducer1  Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Full  Lookup  Table  1x  

Machine2  

Machines…..  

Map  task  in  a  JVM  

4  

Cluster  Centroids  

To  be  Classified  Documents  

Topics  

To  be  Classified  Documents  

Kmeans is an application that takes as input a large number of documents and try to classify them into different topics  

Page 5: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  using  Hadoop  

5  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Machines  …  

To be classified documents Computation Memory

Machine 1 Map task in a JVM

Topics    

Machines …

Page 6: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  using  Hadoop  

6  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Machines  …  

To be classified documents Computation Memory

Machine 1 Map task in a JVM

Topics    

Topics    

Machines …

Page 7: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  using  Hadoop  

7  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Machines  …  

To be classified documents Computation Memory

Machine 1 Map task in a JVM

Topics    

Topics    

Topics    

Machines …

Page 8: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  using  Hadoop  

8  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Machines  …  

To be classified documents Computation Memory

Machine 1 Map task in a JVM

Topics      

Topics    

Topics    

Topics    

Machines …

Page 9: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  using  Hadoop  

9  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Machines  …  

To be classified documents Computation Memory

Machine 1 Map task in a JVM

Duplicated In-memory Cluster Centroids

Topics  1x  

Topics  1x  

Topics  1x  

Topics  1x  

Machines …

Page 10: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Memory  Wall  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

200  

0   50   100   150   200   250   300   350   400  

Num

ber  o

f  top

ics/  Tim

e  in  m

in  

Topics  data  size  (MB)  with  4KB/topic  

KMeans  Throughput  Benchmark  

Hadoop  

10  

We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.  

cluster  size(MB)  

Hadoop  Full  GC  calls    

30   3,542  50   4,390  70   5,186  80   1,108,888  

Page 11: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Memory  Wall  

•  Hadoop’s  approach  to  the  problem  –   Increase  the  memory  available  to  each  Map  Task  JVM  by  reducing  the  number  of  map  tasks  assigned  to  each  machine.    

11  

Page 12: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Kmeans  using  Hadoop  

12  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Cluster  Centroids  1x  

Machines  …  

To be classified documents Computation Memory

Machine 1 Map task in a JVM

Machines …

Topics  2x  

Topics  2x  

Page 13: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Memory  Wall  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

200  

0   50   100   150   200   250   300   350   400  

Num

ber  o

f  top

ics/  Tim

e  in  m

in  

Topics  data  size  (MB)  with  4KB/topic  

KMeans  Throughput  Benchmark  

Hadoop  

13  

Decreased  throughput  due  to  reduced  number  of  map  tasks  per  machine  

We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.  

Page 14: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

HJ-­‐Hadoop  

14  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

No  Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  4x  

Machines  …  

Dynamic  chunking  

To be classified documents Computation Memory

Machine1 Map task in a JVM

Topics  4x  

Machines …

Dynamic  chunking  

Page 15: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

HJ-­‐Hadoop  

15  

To  be  classified  documents  

Computa,on   Memory  

Machine1  Map  task  in  a  JVM  

No  Duplicated  In-­‐memory  Cluster  Centroids  

Cluster  Centroids  4x  

Machines  …  

Dynamic  chunking  

To be classified documents Computation Memory

Machine1 Map task in a JVM

Topics  4x  

Machines …

Dynamic  chunking  

No Duplicated In-memory Cluster Centroids

Page 16: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Habanero  Java(HJ)  

•  Programming  Language  and  RunDme  Developed  at  Rice  University  

•  OpDmized  for  mulD-­‐core  systems  –  Lightweight  async  task  – Work  sharing  runDme  – Dynamic  task  parallelism  –  hcp://habanero.rice.edu    

16  

Page 17: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Results  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

200  

0   50   100   150   200   250   300   350   400  

Num

ber  o

f  top

ics/  Tim

e  in  m

in  

Topics  data  size  (MB)  with  4KB/topic  

KMeans  Throughput  Benchmark  

HJ-­‐Hadoop  

Hadoop  

We  used  2  mappers  for  HJ-­‐Hadoop   17  

Page 18: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Results  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

200  

0   50   100   150   200   250   300   350   400  

Num

ber  o

f  top

ics/  Tim

e  in  m

in  

Topics  data  size  (MB)  with  4KB/topic  

KMeans  Throughput  Benchmark  

HJ-­‐Hadoop  

Hadoop  

We  used  2  mappers  for  HJ-­‐Hadoop   18  

Process  5x  Topics  efficiently  

Page 19: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Results  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

200  

0   50   100   150   200   250   300   350   400  

Num

ber  o

f  top

ics/  Tim

e  in  m

in  

Topics  data  size  (MB)  with  4KB/topic  

KMeans  Throughput  Benchmark  

HJ-­‐Hadoop  

Hadoop  

We  used  2  mappers  for  HJ-­‐Hadoop   19  

4x  throughput  improvement  

Page 20: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

K  Nearest  Neighbor  Join  

0  

50  

100  

150  

200  

250  

0   50   100   150   200   250  Documen

ts  processed

 /  m

in  

Input  Document  size  (MB)  

K  Nearest  Neighbor  Join  

Hadoop  

HJ-­‐Hadoop  

20  

Page 21: HJHadoop An(Op,mized(MapReduce( Run,me(for… … · HJHadoop An(Op,mized(MapReduce(Run,me(for(Mul,#core(Systems(Yunming’Zhang’ Advised’by:’Prof.’Alan’Cox’and’Vivek’Sarkar’

Conclusions  

•  Our  goal is to tackle the memory inefficiency in the execution of MapReduce applications on multi-core systems by integrating a shared memory parallel model into Hadoop MapReduce runtime – HJ-Hadoop can be used to solve larger problems

efficiently than Hadoop. HJ-Hadoop can process 5x more data at full throughput of the system

– The HJ-Hadoop can deliver a 4x throughput relative to Hadoop mapper processing large in-memory data sets

 21  


Recommended