MapReduce on the Cell Broadband Engine Architecturecavazos/cisc879-spring... · MapReduce is just...

Post on 06-Oct-2020

3 views 0 download

transcript

MapReduceMapReduce on the Cell on the Cell

Broadband Engine ArchitectureBroadband Engine Architecture

Marc de Marc de KruijfKruijf

OverviewOverview

�� MotivationMotivation

�� MapReduceMapReduce

�� Cell BE ArchitectureCell BE Architecture

�� DesignDesign

�� Performance AnalysisPerformance Analysis

�� Implementation StatusImplementation Status

�� Future WorkFuture Work

What is What is MapReduceMapReduce??

�� A parallel programming model for largeA parallel programming model for large--scale data scale data

processingprocessing

�� Simple, abstract interfaceSimple, abstract interface

�� Runtime handles all synchronization, communication, and Runtime handles all synchronization, communication, and

scheduling.scheduling.

�� Implementations exist for:Implementations exist for:

�� Large distributed clusters, as in GoogleLarge distributed clusters, as in Google’’s original s original MapReduceMapReduce

�� Shared memory multiprocessors, as in StanfordShared memory multiprocessors, as in Stanford’’s s PheonixPheonix

�� The Cell processor is neither of theseThe Cell processor is neither of these……..

What is the Cell then?What is the Cell then?

�� The Cell is:The Cell is:

�� ““A single chip multiprocessor A single chip multiprocessor

with nine processors operating with nine processors operating

on a shared, coherent memoryon a shared, coherent memory””

�� So whatSo what’’s the difference?s the difference?

�� From a programming From a programming

perspective, the Cell is much perspective, the Cell is much

more like a more like a ““clustercluster--onon--aa--chipchip””

�� That sounds hardThat sounds hard……

�� ItIt’’s not easy.s not easy.

MotivationMotivation

�� Programming the Cell is hardProgramming the Cell is hard…… yetyet……

�� Distributed (shared?) memory singleDistributed (shared?) memory single--chip multiprocessors are chip multiprocessors are

the way of the future.the way of the future. (Opinion.)(Opinion.)

�� Corollary:Corollary: Shared memory is out. Message passing is in. Shared memory is out. Message passing is in.

(More opinion.)(More opinion.)

�� WhatWhat’’s missing are the right runtime and abstraction s missing are the right runtime and abstraction

layers to enable the scalability potential of these and layers to enable the scalability potential of these and

future systems. future systems.

�� MapReduceMapReduce is just such an abstraction.is just such an abstraction.

OverviewOverview

�� MotivationMotivation

�� MapReduceMapReduce

�� Cell BE ArchitectureCell BE Architecture

�� DesignDesign

�� Performance AnalysisPerformance Analysis

�� Implementation Status Implementation Status

�� Future WorkFuture Work

MapReduceMapReduce RefresherRefresher

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Only phase 1Only phase 1

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Only phase 1Only phase 1

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Phase 1 (no hash)Phase 1 (no hash)

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Phase 1 (no hash)Phase 1 (no hash)

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Phase 1 (no hash)Phase 1 (no hash)

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Phase 1 (no hash)Phase 1 (no hash)

Design HighlightsDesign Highlights

�� Output is preOutput is pre--allocated. allocated.

�� Enhances performance as well as output locality.Enhances performance as well as output locality.

�� Work queue allows dynamic scheduling of tasks Work queue allows dynamic scheduling of tasks

among among SPEsSPEs for load balancing. for load balancing.

�� Adding priorities allows pipelining of computation Adding priorities allows pipelining of computation

to maximize resource utilization.to maximize resource utilization.

�� Outside of DMA transfers, there is no data Outside of DMA transfers, there is no data

copying.copying.

OverviewOverview

�� MotivationMotivation

�� MapReduceMapReduce

�� Cell BE ArchitectureCell BE Architecture

�� DesignDesign

�� Performance AnalysisPerformance Analysis

�� Implementation Status Implementation Status

�� Future WorkFuture Work

Performance AnalysisPerformance Analysis

�� Assumptions:Assumptions:�� While there is scheduled work for the While there is scheduled work for the SPEsSPEs, the , the SPEsSPEs are are always the bottleneck.always the bottleneck.

�� Execution time:Execution time:Fixed runtime startup cost =Fixed runtime startup cost = unknown (can be amortized) +unknown (can be amortized) +

Map execution time = Map execution time = (# map keys * map execution time) +(# map keys * map execution time) +

Sort time =Sort time = n n loglog(n(n)/2 , where )/2 , where nn = (= (## intermediate keys / hash range) +intermediate keys / hash range) +

Reduce execution time =Reduce execution time = (# reduce keys * reduce execution time) +(# reduce keys * reduce execution time) +

Sort time =Sort time = n n loglog(n(n)/2 , where )/2 , where nn = (= (## reduce keys / output buffer size) reduce keys / output buffer size)

�� Observations:Observations:

�� Buffer management and partitioning is a necessary part of Buffer management and partitioning is a necessary part of

programming the programming the SPEsSPEs and is not considered overheadand is not considered overhead

�� Sorting is the dominant overheadSorting is the dominant overhead…… letlet’’s examine it furthers examine it further. .

Performance AnalysisPerformance Analysis

�� SortingSorting�� Q: I hear the Q: I hear the SPEsSPEs are not good at control tasks, but sorting is a control task, iare not good at control tasks, but sorting is a control task, isnsn’’t it?t it?

�� A: You are right, letA: You are right, let’’s analyze the sorting performance of s analyze the sorting performance of SPEsSPEs vs. a typical superscalarvs. a typical superscalar

�� AssumptionsAssumptions�� LetLet’’s assume our key comparison function is string compares assume our key comparison function is string compare

�� Furthermore assume that our input strings are uniformly distribuFurthermore assume that our input strings are uniformly distributedted

/* /* strcmpstrcmp */*/

intint keyCompare(keyCompare(constconst voidvoid *one, *one, const voidconst void *two)*two)

{ {

charchar *a1 = (*a1 = (charchar *)one;*)one;

charchar *a2 = (*a2 = (charchar *)two;*)two;

intint i;i;

forfor (i = 0; (i = 0;

__builtin_expect(a1[i] == a2[i], 0) && __builtin_expect(a1[i] == a2[i], 0) &&

__builtin_expect(a1[i] != 0), 1);__builtin_expect(a1[i] != 0), 1);

i++);i++);

ifif (__builtin_expect(a1[i] == a2[i]), 0)(__builtin_expect(a1[i] == a2[i]), 0)

returnreturn 0;0;

else ifelse if (a1[i] > a2[i])(a1[i] > a2[i])

returnreturn 1; 1;

elseelse

returnreturn --1;1;

} }

�� SPESPE strcmpstrcmp( ) average exec time / ( ) average exec time /

comparison : comparison : 36 cycles36 cycles

�� Assuming 3.2 Assuming 3.2 GhzGhz: : 11.25ns / comp11.25ns / comp

�� x86 x86 strcmpstrcmp( ) average inst / comparison : ( ) average inst / comparison : 28 28

instinst

�� Assuming IPC of 1.5 @ 3.2 Assuming IPC of 1.5 @ 3.2 GhzGhz:: 5.8ns / comp5.8ns / comp

�� Bottom line:Bottom line: we have 8 we have 8 SPEsSPEs to 2 (maybe 4) to 2 (maybe 4)

x86 cores:x86 cores:

�� 11.25 / 8 = 11.25 / 8 = 1.406 ns/comp (CELL)1.406 ns/comp (CELL)

�� 5.8 / 4 = 5.8 / 4 = 1.45 ns/comp (1.45 ns/comp (KentsfieldKentsfield))

OverviewOverview

�� MotivationMotivation

�� MapReduceMapReduce

�� Cell BE ArchitectureCell BE Architecture

�� DesignDesign

�� Performance AnalysisPerformance Analysis

�� Implementation Status Implementation Status

�� Future WorkFuture Work

Implementation StatusImplementation Status

�� Very simple test application runs to completionVery simple test application runs to completion�� Larger test program does not return. Debugging using the Larger test program does not return. Debugging using the simulator is erratic and frequently ends with a timeout.simulator is erratic and frequently ends with a timeout.

�� Overestimated the capabilities of the simulator.Overestimated the capabilities of the simulator.

�� Cell Blade time needed, but too late to bear fruit.Cell Blade time needed, but too late to bear fruit.

�� ~2500 loc vs. ~1200 for Phoenix (shared ~2500 loc vs. ~1200 for Phoenix (shared memmem. . implementation)implementation)�� 21 threads total, though synchronization is not too difficult21 threads total, though synchronization is not too difficult

�� Buffer management is the hard partBuffer management is the hard part……

Future WorkFuture Work

�� Wrap up implementationWrap up implementation

�� Add performance counters to quantify overhead.Add performance counters to quantify overhead.

�� Perform applicationPerform application--agnostic runtime analysisagnostic runtime analysis

�� Combine with static analysis to determine Combine with static analysis to determine performance bottlenecksperformance bottlenecks

�� Determine if hash should be on the PPE or SPEDetermine if hash should be on the PPE or SPE

�� Perform applicationPerform application--specific runtime analysisspecific runtime analysis

�� What happened to Marching Cubes??What happened to Marching Cubes??�� Actually mostly done, but no opportunity to testActually mostly done, but no opportunity to test