Improving Node-level MapReduce Performance using Processing-in-Memory Technologies
Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory
Department of Computer Science & Engineering University of North Texas, USA
Mike Ignatowski and Nuwan Jayasena
AMD Research - Advanced Micro Devices, Inc., USA
Overview
• Introduction • Motivation • Proposed Model
• Server Architecture • Programming Framework
• Experiments • Results • Conclusion and Future Work • Related Work • References
2 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Introduction • 3D stacked DRAM consists of DRAM dies stacked on top of a logic die,
• provides higher memory bandwidth, • lower access latencies and • lower energy consumption than existing DRAM technologies
Ø Hybrid Memory Cube (HMC): capacity 2-4 GB, bandwidth 160 GB/sec (15x DDR3), 70%less energy per bit 1
• The bottom logic die contains peripheral circuitry (row decoder, sense amp etc.), but still there is enough silicon for other logic
3
• 3D-DRAM can be used as large Last Level Cache or Main Memory or buffer to PCM • SRAM can be integrated in the logic layer to aid
address translation – hardware page tables • Recent trend is to put processing capabilities in the
logic layer
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Processing in Memory
• Processing-In-Memory (PIM) is the concept of moving computation closer to memory
• Advantages:
Ø Low access latency, high memory bandwidth and high degree of parallelization can be achieved by adding simple processing cores in memory
Ø Minimize cache pollution by not transferring some data to main cores Ø Data intensive/memory bounded applications , which do not benefit from the
conventional cache hierarchies, could benefit from PIM
• Concerns: Ø Designing appropriate system architecture.
§ Too many design choices – main processor, PIM processors, memory hierarchy, communication channels, interfaces
Ø Requires changes to Operating System (memory management), programming framework (e.g. MapReduce library), programming models (synchronization, coherence)
4 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Our Work
• 3D stacked DRAM has generated renewed interest in PIMs • We can use several low power cores in the logic layer of a 3D-DRAM to execute
memory bounded functions closer to memory • Our current research is focusing on Big Data analyses based on MapReduce
programming model
Ø Map functions are good candidates for executing on PIM processors Ø We propose and evaluate a server architecture here Ø MapReduce is modified for shared memory processors
§ We plan to investigate using PIM for other parts of MapReduce applications § And other classes of applications (Scale-Out applications)
§ Contemporary research shows that emerging scale-out applications do not benefit from conventional processor architecture and cache hierarchies 2
5 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Proposed Server Architecture
6
Host
PIM & DRAM controllers
Memory dies
Timing-‐specific DRAM
interface
Abstract load/store interface
• Host processor connected to multiple 3D Memory Units (3DMUs) • PIM cores in the logic layer of each 3DMMU • Simple, in-order, single-issue, energy efficient PIM cores with only L-1 caches • Processes running on host control the execution of PIM threads • Unified Memory View as proposed by Heterogeneous System Architecture (HSA)
foundation • A number of such nodes will make up a cluster
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
Proposed MapReduce Framework
• Adapt MapReduce frameworks for shared memory systems that exhibit NUMA Ø We chose Phoenix++ which works with CMP and SMP systems Ø Needed to modify Phoenix for our purpose
7
• Map phase - overlap with reading input using MP cores (host reads from files) • Reduce phase - By using special data
structures (2D hash tables) allow local reduction in the 3DMUs to minimize amount of data transferred during final reduction • Merge phase – Initial stages can be
performed by PIM cores, and the rest by the host processor
• Here we emphasize on single (intra) node level MapReduce operation, and assume, a global (inter) node level of MapReduce operation will take place if we need a cluster of such nodes.
Master Process Input
Manager Process 0
Manager Process 1
Manager Process 2
Manager Process 3
PIM Threads
PIM Threads
PIM Threads
PIM Threads
Running on host processor Running on 3DMUs
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiment Setup
X eon E5 X eon E5QPI
MEM0
MEM1
Table 1: Baseline System Configuration
CPU 2 x Xeon E5-2640 6 cores per processor, 2 threads/core Out-of-Order, 4-wide issue
Clock Speed 2.5 GHz clock speed
L3 Cache 15 MB/processor
Power TDP = 95 W/processor Low-power = 15 W/processor
Memory BW 42.6 GB/s per processor
Memory 32 GB (8 x 4 GiB DIMM DDR3), NUMA enabled
Table 2: New System Configuration
Host Processor PIM cores
Processing Unit
1 Xeon E5-2640 6 cores, 2 threads/core Out-of-Order, 4-wide issue
64 = 4 * 16 ARM Cortex-A5 In-order, single-issue
Clock Speed 2.5 GHz 1 GHz LL Cache 15 MB 32 KB I and 32 KB D /core Power TDP = 95 W
80 mW/core (5.12 W for 64 cores)
Memory BW 42.6 GB/s 1.33 GB/s per core Memory 32 GB (4 8 GiB 3DMU)
8
• Baseline vs. New System Configuration
P0 P1
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiments and Analysis
• Our assumption is that we can overlap reading of data with the execution of map tasks • The input reading is performed by the host CPU and the map tasks by PIM cores
Ø We do not want the cores to sit idle Ø Estimate the number of cores needed
Fig : (a) PIM cores mostly idle (b) PIM core utilization is high
9
0 32 64 96 128 160 192 224
IP MU0 IP MU1 IP MU2 IP MU3 IP MU0 IP MU1 IP MU2 IP MU3
0 32 64 96 128 160 192 224
IP MU0 IP MU1 IP MU2 IP MU3 IP MU0 IP MU1 IP MU2 IP MU3
idle busy idle busy idle busy busy
busy busy busy
Time (ms)
Host reads IP splits Into 3DMUs
PIM cores in 3DMU0 PIM cores in 3DMU1 PIM cores in 3DMU2 PIM cores in 3DMU3
(a)
idle busy busy busy busy
busy busy busy
(b)
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiments and Analysis
10
The time taken by PIM cores to process a input split should be smaller than the time taken by the host to read one input split
Here s is the factor that indicates the relative slowdown caused by simple PIM cores when compared to the host.
is the time taken by host to complete map function on one input split
There are 4 DMUs and each contains n PIM cores
is the time taken by host to read one input split
How many PIM cores per 3DMU do we need?
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiments and Analysis
• We ran different workloads on the baseline system with Phoenix++ and we measured
• We used two different storage technologies-
HDD and SSD on the baseline system to measure the
11
Table 3:
Workload
word count 25 ms
histogram 7 ms
string match 12 ms
linear regression 7 ms
Table 4:
Storage
HDD 10.42 ms
SSD 2.17 ms
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
• To achieve full utilization of the PIM cores following must hold,
• We estimated how many cores (n) we need to achieve the overlapping of map tasks with input reading for different storage technologies and slowdown factors
• For an estimated slowdown factor of 4 for the PIM cores, we need fewer than16 cores per 3DMU – but use 16 cores to handle stragglers
Experiments and Analysis
Fig: Required number of PIM cores for different slowdown factors
12 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Feasibility Analysis
• Embedding 16 PIM cores on the logic layer of a 3DMU is possible Ø We use ARM Cortex-A5 as PIM core to estimate silicon area needed
§ 40 nm processing technology § 1 GHz clock § 32 KB D and 32 KB I L-1 caches
• Area? Ø Each PIM core needs an area of 0.80mm2 Ø The area overhead for 16 PIM core is only ~12% in the logic die 3
• Power budget? Ø Each PIM core has an average power consumption of 80 mW Ø 16 PIM core will consume 1.28W, which is only ~13% of the 10W TDP budget of
the logic layer 4
13
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Performance Analysis
14
The performance gain using 16 PIM cores per 3DMU, assuming a slow down factor of 4 is shown here We reduced the execution time by tmap The average reduction is 8% We feel overlapping some merge and reduce functions lead to higher gains Gains are higher if input does not fit in memory
3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:
throughput-oriented programmable processing in memory. In: HPDC, (2014)
Energy Analysis
• Data is for 16 PIM cores in each
3DMU (total 64) • Relative energy savings range from
10% to 23% with the absolute energy savings range from 80J to 2045J
• Energy reduction is due to low power cores
15
Fig: Energy consumption by processing units 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Bandwidth Utilization
• On the baseline system the peak BW consumption is less than 15GB/s, • Assuming 64 PIM cores, peak BW consumption 60GB/s (15GB/s at each 3DMU) • Each SerDes link provides 40GB/s with 5W power consumption 5
• 64 ARM Cortex-A5 (if placed on host chip) can consume at most 88GB/s 6, requiring at least 3 SerDes links to transfer data between host and 3D DRAM • But when we use PIM cores these high bandwidth links are not required
Fig: Bandwidth consumption when running wordcount on two different systems.
16 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Summary and conclusions
• We propose to use simple energy efficient cores embedded in the logic layer of 3D-DRAMs
• We show how our architecture can be used for MapReduce workloads
• We estimated the number of PIM cores needed to achieve a good balance of work between host and PIMs
• We have shown that we achieve both energy savings and performance gains
• We have found that most of the applications do not need the high bandwidth offered (or proposed) by current prototypes of 3D-stacked DRAMs
17 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Future Work
• Conduct experiments using variety of scale-out applications
• Investigate the impact of reduced memory bus traffic on the memory hierarchy Ø Bandwidth utilization
§ Use low-energy buses with smaller bandwidth instead of high-speed SerDes links
Ø Alternative cache organization § What are the savings?
• ARM Cortex -A5 Ø Do we need such processor complexity? Ø Simple RISC cores? Even less energy consumption Ø GPGPUs, FPGAs
• Estimate the performance of a cluster comprising of proposed nodes
18 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Related Study
• Several studies are conducted on using 3D-DRAM as LLC or main memory [1, 2]
• Researchers have worked in the PIM idea a decade ago [3, 4, 5, 6], but integrating DRAM and logic in the same die was not quite successful at that time.
• The Phoenix++ MapReduce framework [7] works for conventional large-scale shared memory CMP and SMP systems. We use Phoenix++ as our basis and propose changes to adapt it to our PIM architecture.
• Near Data Computing (NDC) architecture [8] provides a similar idea to our study and assumes 3D-DRAMs embedded with processing cores. Ø The NDC study works only with in-memory MapReduce workloads
• Recent study on scale-out cloud applications shows that they do not benefit from common server class processors with complex micro-architecture and deep cache hierarchy. Also they do not need high bandwidth on and off-chip links [9].
19 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
References
1. Black, B., Annavaram, M., Brekelbaum, N., DeVale, et al.: Die stacking (3D) microarchitecture. In: Micro, pp. 469-479. IEEE, (2006)
2. Zhang, D. P., Jayasena, N., Lyashevsky, A., et al.: A new perspective on processingin-memory architecture design. In: Proceedings of the ACM SIGPLAN Workshop
3. Patterson, D., Anderson, T., Cardwell, N., et al.: A case for intelligent RAM. In: Micro, 17(2), 34-44. IEEE, (1997)
4. Torrellas, J.: FlexRAM: Toward an advanced Intelligent Memory system: A retrospective paper. In: Intl. Conference on Computer Design, pp. 3-4. IEEE, (2012)
5. Draper, J., Chame, J., Hall, M., et al.: The architecture of the DIVA processing-inmemory chip. In: Proceedings of the Supercomputing, pp. 14-25. ACM, (2002)
6. Rezaei, M., Kavi, K. M.: Intelligent memory manager: Reducing cache pollution due to memory management functions. In: Journal of Systems Architecture, 52(1), 41-55. (2006)
7. Talbot, J., Yoo, R. M., Kozyrakis, C.: Phoenix++: modular MapReduce for sharedmemory systems. In: Proceedings of the international workshop on MapReduce and its applications, pp. 9-16. ACM, (2011)
8. Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: International Symposium on Performance Analysis of Systems and Software. (2014)
9. Ferdman, M., Adileh, A., Kocberber, O., et al.: A Case for Specialized Processors for Scale-Out Workloads. In: Micro, pp. 31-42. IEEE, (2014)
20 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Thank You Questions ?
21 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on
MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)