Date post: | 15-May-2018 |
Category: |
Documents |
Upload: | truongdung |
View: | 216 times |
Download: | 1 times |
Heterogeneous Architecture Design with Emerging
3D and Non-Volatile Memory TechnologiesQiaosha Zou∗, Matthew Poremba∗, Rui He†, Wei Yang†, Junfeng Zhao†, Yuan Xie‡
∗ Computer Science and Engineering, The Pennsylvania State University, USA† Huawei Shannon Lab, China
‡ Electrical and Computer Engineering, University of California, Santa Barbara, USA
Email: ∗{qszou, mrp5060}@cse.psu.edu, †{ray.herui,william.yangwei,junfeng.zhao}@huawei.com
Abstract—Energy becomes the primary concern in nowadaysmulti-core architecture designs. Moore’s law predicts that theexponentially increasing number of cores can be packed into asingle chip every two years, however, the increasing power densityis the obstacle to continuous performance gains. Recent studiesshow that heterogeneous multi-core is a competitive promisingsolution to optimize performance per watt. In this paper, differenttypes of heterogeneous architecture are discussed. For each type,current challenges and latest solutions are briefly introduced.Preliminary analyses are performed to illustrate the scalabilityof the heterogeneous system and the potential benefits towardsfuture application requirements. Moreover, we demonstrate theadvantages of leveraging three-dimensional (3D) integration onheterogeneous architectures. With 3D die stacking, disparatetechnologies can be integrated on the same chip, such as theCMOS logic and emerging non-volatile memory, enabling a newparadigm of architecture design. 1
I. INTRODUCTION
As Moore’s law predicted, the number of transistors doubles
every 18 months. However, the performance is not expo-
nentially scaled because it is restricted by the scaling speed
mismatch of power consumption and memory bandwidth.
Furthermore, in nowadays computing systems, energy effi-
ciency becomes the primary concern during system design.
The traditional scale out strategy by packing more cores into
a single chip is no longer power sustainable. The dark silicon
concept emerges as a 2× shortfall occurs when powering
a chip at its native frequency [1]. Therefore, the utilization
rate of cores in the same area drops exponentially across
generations. Fortunately, single-chip heterogeneous multi-core
is one potential solution to balance the power consumption
and performance enhancement. The heterogeneous multi-core
combines the conventional processors with the emerging com-
puting fabrics, such as GPGPUs and NVMs.
Meanwhile, the slowly growing number of pin count poses
another challenge on both homogeneous and heterogeneous
systems. Compared to the exponential growing rate of tran-
sistors, pin counts only increase linearly, resulting in scare
bandwidth resources. Moreover, because of the application
variety and blooming of media and streaming processing, the
bandwidth is a crucial factor for performance and through-
put. Expanding chips on the third direction, which is three
1Zou, Poremba, and Xie were supported in part by NSF 1218867, 1213052,1409798, and 1017277 and SRC grants
dimensional integrated circuits (3D ICs), can alleviate the pin
count limitation with low latency, high bandwidth vertical
connections. Since the processing isolation between chips, 3D
ICs further accelerates the adoption of heterogeneous system
in a cost-efficient fashion. Figure 1 shows a sketch of future
3D computing system with layers of CPUs, GPUs and accel-
erators and on-chip hybrid memories are placed by the side
of computing fabrics with interposer [2, 3]. TSVs (through-
silicon vias) are used as the vertical connection between each
tier, providing high bandwidth, low latency interconnects.
����������� ���� ����
���������������������������������� �������
�����������
�������������������� ����
�������������
����
���
����������������
������� ��
� !����� ����������
Fig. 1. Overview of future 3D computing system combining CPUs, GPUs,accelerators, on-chip memory, and off-chip hybrid memory.
In the 3D heterogeneous system, digital and analog fabrics
can be integrated in a single chip while each component can be
optimized separately and has its own optimal clock frequency,
supply voltage, and even technology node. In the digital
part, several layers of CPUs are built containing multiple
cores with diverse computing capabilities and pipeline depths
for power efficiency. Tiers of GPUs are stacked providing
massive processing elements for highly parallelism. Numerous
accelerators that are dedicated for the target applications are
designed by requirement and integrated at low cost. Moreover,
on-chip memory is stacked to satisfy the increasing memory
bandwidth demand through short latency TSV arrays. Off-chip
hybrid memory moves in closer proximity with the processing
chip thanks to the interposer. The CPUs and GPUs connect
with off-chip memory through specific memory interface and
high bandwidth on-chip interconnects.
In this paper, three major heterogeneous integration strate-
gies in the multi-core field are summarized. The first one
is the single-ISA heterogeneous multi-core architectures in
978-1-4799-7792-5/15/$31.00 ©2015 IEEE
9B-1
785
Section II. Specifically, we focus on the core integration of
different technology nodes utilizing 3D stacking. This kind
of heterogeneity can maximize the performance within given
cost and power budget. The second comes with the most
popular heterogeneous ISA architecture that integrates con-
ventional CPUs and other unconventional processing elements
(GPUs, FPGAs, and accelerators). We mainly focus on the
integration of CPU and GPU in Section III. The performance
on both latency-aware and throughput-aware applications can
be beneficial from this integration. This strategy is advocated
by numerous industrial products, such as AMD Fusion, Intel
Sandy Bridge, and NVidia Tegra [4, 5, 6]. The last one in
Section IV is the combination of memories with different
materials, namely, the integration of traditional DRAM/SRAM
technology and the emerging non-volatile memory (NVM).
Leveraging the low standby power property of NVMs and
short access latency of traditional memories, the memory
bandwidth can be guaranteed with affordable power consump-
tion.
II. SINGLE-ISA HETEROGENEOUS MULTI-CORE
ARCHITECTURES
The applications nowadays show enormous diversity on
the demands of computing resources. Furthermore, the re-
quirement varies during different execution phases even in
the same program. Therefore, providing a uniform multi-
core architecture with the general-purpose computing capa-
bility is over-provisioned, stressing the power management.
Instead of packing more and more powerful cores, designers
are switching to seek the solution using cores with various
computing capabilities to provide just enough service. The
most efficient design without involving much re-design efforts
is the single-ISA (instruction set architecture) multi-core ar-
chitecture, which keeps the cores have the same ISA, only
varying the computing resources .
The prevalent architecture design with single-ISA is the
combination of high performance cores (big core) and low
power cores (small core). Big cores have higher performance
to handle compute intensive and latency sensitive applications
at the cost of higher power consumption. Small cores are the
simple and low power processors (i.e. in-order processors)
with lower throughput, however, they are more energy-efficient
towards the applications that are memory intensive. The per-
formance and power modelings of this multi-core architecture
are first conduct in the exploration on Alpha cores [7].
The researchers examine the energy saving and performance
degradation of SPEC benchmarks over the architecture con-
taining four cores: two in-order small cores and two out-of-
order big cores. The results show that a 39% average energy
reduction is achieved with only 3% performance degradation.
The energy saving is even better than applying the dynamic
voltage/frequency scaling.
The idea of big core and small core integration is promoted
by industry as ARM announced their heterogeneous multi-
core, named big.LITTLE [8]. In their design, Cortex-A15,
which is a triple-issue out-of-order processor, works as the
big core, while Cortex-A7, an in-order, non-symmetric dual-
issue processor, is the little core. In general, Cortex-A15 has
2-3× higher performance than A7, however, A7 is about 3-4×more energy efficient due to the different pipeline lengths.
Due to the various energy and performance characteristics in
the heterogeneous system, the application scheduling on the
appropriate core is crucial. Moreover, since the application
requirement changes along the execution phases, it is neces-
sary for dynamic application mapping and migration. Recently,
substantial research efforts are made to accurately model the
application performance under different cores [9] and conduct
application mapping/scheduling and migration [10, 11, 12].
An analytical performance model is used and the fundamen-
tal design tradeoffs of single-ISA heterogeneous multi-core
is studied [9]. In addition to the whole system throughput
as focused by previous work, the per-program performance
is also considered as one of the design metrics. From the
determination of the frontier of Pareto-optimal architectures,
it is found that there is no such an optimal configuration
in heterogeneous system that can balance the per-program
performance and system throughput. Fundamentally, the single
performance is traded for the system throughput. Moreover,
the effectiveness of heterogeneity is heavily depended on the
job mapping.
In general, the task scheduling can be performed statically
or dynamically. The application characteristic can be extracted
before execution and be used to determine the suitable core
statically [13]. For dynamic application mapping, characteris-
tics of tasks and processors, such as power, utilization, and
bandwidth requirement, are monitored. The appropriate core
is then selected for better speedup or power-efficiency. For
instance, the balance of power-performance guides the task-to-
core mapping utilizing the price theory [10]. The application
characteristics vary over time, thus a predictive trace-based
controller predicts the upcoming phase and migrates the exe-
cution accordingly [11]. The processor utilization affects the
system performance and energy, thus, a utilization based load
balancing demonstrated by Linux is proved to reduce energy
up to 11.35% [12].
In addition to the basic integration of big core and small
core, heterogeneous multi-core design with different technol-
ogy nodes is another possible solution for power efficient
architecture. Moreover, the kind of design can be cost-efficient
as we can reduce the cost by integrating components from
earlier and mature fabrication processes. As shown in Figure 2,
at the beginning of new technology node, the cost is extremely
high compared to primitive technologies. However, the new
technology guarantees the integration density and transistor
speed, driving the industry to adopt this technology. On the
other hand, in addition to the cost, the ever growing power
density and leakage power of smaller feature size make the
previous technologies appealing. Therefore, instead of building
all the cores with the same technology, we can have some
cores using the older technologies and use the price gap to
integrate more cores. For example, if the transistor cost in
technology N is four times of the cost in technology N-1. We
can integrate up to four cores with node N-1 with the same
cost as integrating only one core with node N.
For cores in different technology generations, the clock
frequency and power supply become the major challenges,
9B-1
786
Nor
mal
ized
tran
sist
or C
ost
0
0.25
0.5
0.75
1
TimeLine
1Q10 3Q10 1Q11 3Q11 1Q12 3Q12 1Q13 3Q13 1Q14 3Q14
40nm 28nm 20nm
Fig. 2. The normalized transistor cost of different technology nodes. [14]
as the intrinsic threshold voltage of technology determines
the minimum voltage supply and the corresponding maximum
clock frequency. Therefore, building cores with different tech-
nology nodes on the single die requires sophisticated power
and clock designs. Fortunately, the emerging 3D integration
can solve this problem by constructing the cores on different
dies [15]. In each die, it can has its own power domain
and clock network without interfering others under the help
of cross power domain interface. Moreover, the fabrication
process can be simplify because each die is processed sepa-
rately. Therefore, designers can focus on their own design and
different companies can cooperate easily.
III. HETEROGENEOUS ARCHITECTURES CONTAINING
CPU AND GPUS
The energy efficiency is forcing the microprocessor design-
ers to exploit new architectures with sustainable performance
per watt. In the previous section, the single-ISA multi-core
architecture is illustrated. However, the single-ISA heterogene-
ity still has limitation all the cores are only good at handling
latency-aware applications. Therefore, recent researches have
found that combining the advantages of CPU and GPU can
sustain the performance improvement of both latency and
throughput aware applications in an energy efficient way [16].
CPU is powerful in sequential computing for lower latency
while GPU provides higher throughput with massive parallel
data processing.
The general purpose GPU was first brought out for
throughput-oriented graphics applications because it contains
a large number of processing elements. These elements can be
operated in parallel with simple instruction yet massive data
in any types. However, the GPU can not be a substitute for
CPU, because the operation frequency of GPUs is much slower
than that of CPUs. CPUs can handle complex instructions at
higher frequency and the latency is of the highest priority. For
example, in Intel core i7, the CPU clock rate is about 3GHz
while the frequency of recent GPU is 1300MHz [17, 18].
Moreover, even GPU provides higher throughput, for the
applications that have little margin for parallelism, there is
no significant performance gain.
Due to the distinct bandwidth requirements, usually CPUs
and GPUs are using different memories. When combining
CPUs and GPUs for cooperation, the data sharing and move-
ments become the bottleneck for performance improvement.
For a single application with both serial and parallel sections,
the data exchanges are necessary between CPUs and GPUs.
The data that are needed by GPUs should be copied from
CPUs via interconnect, PCIe, for example. However, the
bandwidth of PCIe (about 20GB/s [19]) is dramatically smaller
than the bandwidth of GPU memory (more than 200GB/s),
resulting in large data sharing latency.
Fortunately, integrated CPUs and GPUs with shared mem-
ory is proposed. The extensive data duplications between CPU
and GPU is eliminated thanks to the support of a unified mem-
ory address space. Nevertheless, the introduction of unified
memory space burdens the shared memory bandwidth and the
memory request scheduling. The latency sensitive feature of
CPUs indicates low tolerance on memory latency. Even GPUs
have little requirement on memory latency as they are designed
to hide the long latency through thread level parallelism, they
occupy high memory bandwidth for a relatively long period.
The integration causes the bandwidth loss due to the sharing,
and the performance is thus degraded.
There are two directions for solving the shared bandwidth
problem. The first is to device a high bandwidth memory to
provide adequate resource for both CPUs and GPUs. The 3D
stacked memory is one competitive solution, as demonstrated
by Hybrid Memory Cube [20] and High Bandwidth Mem-
ory [21]. According to the specifications, HMC can provide up
to 320GB/s bandwidth while HBM is dedicated for graphic
applications with 256GB/s bandwidth.
Another direction is to intelligently schedule the memory
requests for fair and efficient bandwidth sharing [22, 23, 24].
The bandwidth can be reduced from the main memory level
or the cache level. Staged memory is proposed to decouple
the memory controller’s task into three stages. The first stage
groups requests based on row-buffer locality, and the second
stage schedule the inter-application requests. The last level
deals with the DRAM commands and timing to perform final
data operations [22]. Unlike previous studies that focus on the
multiple applications scenarios, the memory scheduling on a
single parallel application is dramatically different. Various
optimization techniques are proposed to enhance the row-
buffer locality and the overall throughput is improved up to
8% [24]. The last level cache becomes effective at hiding the
memory latency only when the effect of multi-threading in
GPUs is invalid. Therefore, it is not necessary to increase the
cache hit rate for GPGPUs. Consequently, a core sampling
mechanism is proposed to exploit the symmetric behavior
of GPGPU applications and two new thread-level parallelism
aware cache management strategies are proposed [23]. The
available bandwidth for CPUs can be even worse if the
coherence requests affect the bandwidth. Traditional directory-
based coherence design is hard to directly applied on the
heterogeneity scenario. A region directory replaces traditional
directory and region buffers are added for both CPU and
GPU L2 caches to track the permission regions. By moving
the coherent requests onto the incoherent direct-access bus,
the bandwidth to the directory is reduce by an average of
94% [25].
9B-1
787
In addition to the bandwidth allocation, another challeng-
ing task is how to efficiently execute tasks on the system,
including the programming model and task scheduling from
both software and hardware aspects [26, 27, 28, 29]. In the
heterogeneous platform, the ISA and functionality of GPUs
are fundamentally different from the general purpose CPUs.
Therefore, the traditional and developing applications would
be tailored to fit the new architecture with great effort given
the developers may not be familiar with the new features.
OpenCL [26] is emerging as the first open standard for cross-
platform, parallel programming of modern architectures. In
addition to the open standard, several studies vary the basic
programming model for their own specific target systems.
An integrated C/C++ programming environment supporting
specialized cores is proposed for their heterogeneous ISA-
based MIMD architectures [27]. An OpenCL-based frame-
work, SnuCL, is proposed to get OpenCL applications portable
between compute devices in the heterogeneous clusters [29].
Performance modeling is also an interesting topic in het-
erogeneous system as it can predict the scalability at the early
design stage. One of the simple analytical model that can
applied is Amdahl’s Law. Previous studies have extended the
Amdahl’s Law into the heterogeneous computing era [30, 31].
The future computer will integration various unconventional
computing units (GPGPUs, FPGAs, and ASICs) as suggested
by the measurements and predictions [30]. In addition, the
study also shows that sufficient parallelism is the prerequisite
for the significant performance gains from heterogeneous
computing. Moreover, bandwidth is the first-order concern in
developing efficient system.
Two processing modes are available for CPUs and GPUs
integration: asymmetric and simultaneous asymmetric. The
first mode divides a program into three segments: serial
execution on one CPU, parallel execution on multi-cores,
and parallel execution on GPUs. The second mode schedules
different programs onto CPUs and GPUs which computes
simultaneously. Amdahl’s Law is revised to capture the system
configuration leading to the optimal speedup [31]. The study
is based on the structure that CPU and GPU share the same
memory space. We extend the work to take the data sharing
delay into consideration and compare the speedups between
unified and separate memory space.
Similar to the definition in previous work [31], the total
numbers of CPUs and GPUs are denoted by c and g. However,
due to the power constraint which is measured by a single
CPU power consumption, the number of GPUs is limited by
(PB − c)/wg , where PB is the power budget and wg is the
power ratio between GPU and CPU. Assume the portion of
the program can be paralleled is f and the portion of parallel
program running on multi-cores is α. β is the execution
timing ratio of GPU to CPU. γ is the data sharing latency
normalized to the program computation time on a single CPU.
For memory intensive applications, the value of γ can be
larger than 1. Note that, we assume that the memory latency
of data sharing between CPUs is negligible. The theoretical
asymmetric speedup is represented as follows:
Speedup =1
(1− f) + αfc
+ γ + (1−α)fgβ
(1)
Figure 3 shows the speedups for a program running on
CPU+GPU system. We vary the CPU core from 1 to 99 and
examine the speedup when γ equals 0, 0.1, 0.5, 1.0, and 2.5.
The result curve indicates that under the power constraint,
the speedup increases at first with more CPUs, then after a
turning point, integrating more CPUs results in performance
degradation because of the reduced parallel components (one
CPU power consumption equals to four GPUs). Furthermore,
when the data sharing delay is sufficient large, there is no
performance benefit of CPU+GPU integration. The significant
benefit of unified memory space is indicated from this result.
Spe
edup
0
1
2
3
4
CPU Count
1 10 100
0 0.1 0.5 1.0 2.5
Fig. 3. Speedups considering the data sharing delay. The x-axis is in logarithmscale. f = 0.7, α = 0.5, β = 0.5, and wg = 0.25.
IV. HETEROGENEOUS MEMORY ARCHITECTURES
The memory and storage system is a critical component
in various computer systems. More and more applications
are shifting from computing bounding to data bounding. A
hierarchy of memory and storage components are used to effi-
ciently store and manipulate a large amount of data. However,
the performance and energy scaling of main-stream memory
technologies cannot catch up the requirements of current
computing systems. The commodity memory technologies,
such as SRAM and DRAM, are facing scalability challenges
due to the limitations of their device cell size and power
dissipation. In particular, the leakage power of SRAM and
DRAM and the refresh power of DRAM are increasing, which
contribute a significant portion of overall system energy [32].
Therefore, an energy efficient memory subsystem is in urgent
need to continue the system improvement.
Recently, emerging byte-addressable nonvolatile memory
technologies have been studied as the replacement of tradi-
tional memories due to their promising characteristics: higher
density, lower leakage power, and non volatility. The represen-
tative technologies include spin-transfer torque memory (STT-
RAM), phase-change memory (PCM), and resistive memory
(ReRAM) [33, 34, 35]. Nevertheless, several shortcomings of
NVMs, such as high write latency/power and low endurance,
impede the direct adoption and fully replacement of NVMs.
9B-1
788
Heterogeneous integration with SRAM/DRAM and NVMs is
one potential solution towards the design challenges.
Most NVM technologies are not compatible with CMOS
technology which is the traditional technology used to im-
plement the digital logics. Consequently, with most types of
NVMs, silicon interposer or 3D stacking is leveraged for the
implementation [36]. Layers of different NVM technologies
and traditional SRAM/DRAM can be integrated. Moreover,
the 3D stacked memory can increase the memory capacity
and bandwidth in a cost and energy efficient fashion. For
instance, Sun et al. [33] proposed a 3D cache architecture
design with STT-RAM as L2 cache. The system power is
reduced by 70% and performance is moderately improved as
demonstrated by the study due to the non volatility property
of STT-RAM. CMPs are vulnerable to soft errors, which can
be eliminated by stacking all levels of memory hierarchy with
STT-RAM. The system performance is improved by 14.5%
with 13.44% power reduction [37]. The non-volatility and
soft-error resistivity features of NVM make it attractive to
FPGAs. The 3D PCM-based FPGA exhibits advantages over
3D FPGAs with traditional memory technologies in terms of
power consumption, wirelength, and critical path delay [38].
Main memory needs to be sufficiently large to hold most of
the data in applications. Commodity computer systems lever-
age DRAM as main memory, which may not be sustainable
due to the inherent scalability fo DRAM. Among various
NVMs, PCM is believed to be the best candidate for main
memory. However, using PCM alone as main memory has
endurance and power problem as indicated by Lee et al. [34].
From the study, they show that a pure PCM-based main
memory can be 1.6× slower and consumes 2.2× higher energy
than DRAM-based memory due to the high write latency and
energy consumption. In the DRAM and PCM hybrid memory,
DRAM can work as a buffer to serve memory writes with
low latency or it can be placed parallel with PCM to share
portion of the memory requests. Qureshi et al. [39] propose the
main memory design with a PCM region and a small DRAM
buffer. Their study show that a 3× speedup can be achieved
due to the benefit from both short latency of DRAM and high
capacity of PCM. Ramos et al. [40] study the energy-delay2
(ED2) of the hybrid system when pages are migrating between
DRAM and PCM by monitoring access patterns. Their system
is more robust and has lower ED2 through the simulation of
27 workloads (SPEC, SPEC2006, and Stream suites).
Lately, the big data application emerges as one mainstream
application on the datacenter and warehouse-scale computing.
The enormous memory footprint and energy consumption
make the DRAM+PCM hybrid memory more attractive. In
this section, we preliminarily explore the latency and energy
of these two different hybrid memory styles (DRAM as buffer
or main memory) under big data applications. We extract the
memory traces of 6 applications (aggregation, join, kmeans,
pagerank, select, terasort) from big data benchmark [41]
and two traditional applications (volrend, radiosity) from
SPLASH-2 [42]. Then the traces are replayed using the non-
volatile memory simulator NVMain [43] for latency and power
estimation. The total off-chip memory capacity keeps constant
as 4GB and the frequency of memory is 800MHz. When the
system contains both DRAM and PCM as main memory, the
DRAM capacity is 1GB while the PCM capacity is 3GB.
When the DRAM is used as cache, the capacity is 32MB.
Figure 4 shows the average latency and power consumption
of four memory configurations: pure DRAM, pure PCM,
DRAM+PCM, and DRAM as cache. Due to the long write
latency of PCM, pure PCM has the highest latency compared
to other three cases. It is obvious that pure DRAM has the
lowest latency. The latency of DRAM cache is better in three
applications (pagerank, volrend, radiosity), especially in the
traditional applications, due to the relatively higher cache hit
rate. The largest latency of volrend occurs in the DRAM+PCM
case because most of the memory accesses go to PCM, which
is also suggested by the power consumption. In the power
consumption part, the low hit rate of DRAM cache in big
data benchmarks leads to the high power consumption. More-
over, due to the metadata which are located in DRAM, the
DRAM cache has almost the same level of power consumption
compared to the pure DRAM cases. The power consumption
of DRAM+PCM is very close to pure PCM because of the
relatively small DRAM capacity.
Ave
rage
Lat
ency
(C
lock
Cyc
les)
0
300
600
900
1200
Benchmark
aggregation join kmeans pagerank select terasort volrend radiosity
DRAM PCM Parallel DRAM Cache
(a) Average Latency
Ave
rage
Pow
er (
W)
0
0.35
0.7
1.05
1.4
Benchmark
aggregation join kmeans pagerank select terasort volrend radiosity
DRAM PCM Parallel DRAM Cache
(b) Power Consumption
Fig. 4. The comparison of the average latency and power consumption forfour memory configurations: pure DRAM, pure PCM, DRAM+PCM, andDRAM cache.
There is no single winner as illustrated by the results.
However, we can conduct some optimizations towards the
hybrid memory design to mitigate the long latency of PCM and
high energy of DRAM. For example, we can use hierarchical
metadata placement to reduce the access amount of DRAM
with energy efficiency [44]. We can also develop an intelligent
9B-1
789
data placement to balance the workloads between DRAM and
PCM [45] and increase the cache hit rate.
V. CONCLUSION
The heterogeneous multi-core architecture promises the
power sustainable performance scaling in future computer
systems. In addition to the traditional CPUs and memories,
emerging computing fabrics and non-volatile memory are
engaged to reduce the performance per watt. In this paper,
three typical heterogeneous systems are introduced: single-ISA
multi-core, integration of CPU and GPU architecture, and the
hybrid memory system. Moreover, the heterogeneous system
can leverage 3D integration to further expand the design space.
Despite the promising features of heterogeneity, several keen
challenges should be tackled, such as task mapping/scheduling
and application-specific design optimizations.
REFERENCES
[1] M. Taylor, “Is dark silicon useful? harnessing the four horsemen ofthe coming dark silicon apocalypse,” in Design Automation Conference,2012.
[2] Y. Koizumi, N. Miura, E. Sasaki, Y. Take, H. Matsutani, T. Kuroda,H. Amano, R. Sakamoto, M. Namiki, K. Usami, M. Kondo, andH. Nakamura, “A scalable 3D heterogeneous multi-core processor withinductive-coupling thruchip interface,” IEEE Micro, vol. 33, pp. 6–15,2013.
[3] S. Borkar, “3D integration for energy efficient system design,” in Design
Automation Conference, 2011.[4] A. Branover, D. Foley, and M. Steinman, “AMD Fusion APU: Llano,”
IEEE Micro, vol. 32, pp. 28–37, 2012.[5] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, “A fully integrated
multi-CPU, GPU and memory controller 32nm processor,” in IEEE
International Solid-State Circuits Conference, 2011.[6] “Nvidia Tegra,” http://www.nvidia.com/object/white-papers.html.[7] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas,
“Single-ISA heterogeneous multi-core architectures for multithreadedworkload performance,” in International Symposium on Computer Ar-
chitecture, 2004.[8] “ARM big.LITTLE technology,” http://www.arm.com/products/processors/
technologies/biglittleprocessing.php.[9] K. Van Craeynest and L. Eeckhout, “Understanding fundamental design
choices in single-isa heterogeneous multicore architectures,” ACM Trans.
Archit. Code Optim., vol. 9, pp. 1–32, 2013.[10] T. Somu Muthukaruppan, A. Pathania, and T. Mitra, “Price theory based
power management for heterogeneous multi-cores,” in International
Conference on Architectural Support for Programming Languages and
Operating Systems, 2014.[11] S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke, “Trace based phase
prediction for tightly-coupled heterogeneous cores,” in International
Symposium on Microarchitecture, 2013.[12] M. Kim, K. Kim, J. R. Geraci, and S. Hong, “Utilization-aware load bal-
ancing for the energy efficient operation of the big.LITTLE processor,”in Design, Automation and Test in Europe, 2014.
[13] J. Chen and L. John, “Efficient program scheduling for heterogeneousmulti-core processors,” in Design Automation Conference, 2009.
[14] “Cost scaling trend,” http://www.extremetech.com/computing/123529-nvidia-deeply-unhappy-with-tsmc-claims-22nm-essentially-worthless.
[15] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, “Design space explorationfor 3D architectures,” Journal of Emerging Technology Computing
Systems, vol. 2, pp. 65–103, 2006.[16] “Amd heterogeneous system architecture,”
http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/.
[17] “Intel core i7,” http://ark.intel.com/products/37148/.[18] “Nvidia GeForce GTX 980,” http://www.geforce.com/hardware/desktop-
gpus/geforce-gtx-980/specifications.[19] “PCI Express 3.0,” https://www.pcisig.com/specifications/pciexpress/base3/.[20] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM archi-
tecture increases density and performance,” in Symposium on VLSI
Technology, 2012.[21] JEDECHBM, “http://www.jedec.org/category/technology-focus-area/3d-
ics-0.”
[22] R. Ausavarungnirun, K.-W. Chang, L. Subramanian, G. Loh, andO. Mutlu, “Staged memory scheduling: Achieving high performanceand scalability in heterogeneous systems,” in International Symposium
on Computer Architecture, 2012.[23] J. Lee and H. Kim, “TAP: A TLP-aware cache management policy for a
CPU-GPU heterogeneous architecture,” in International Symposium on
High Performance Computer Architecture, 2012.[24] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, “Memory scheduling
towards high-throughput cooperative heterogeneous computing,” in In-
ternational Conference on Parallel Architectures and Compilation, 2014.[25] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill,
S. K. Reinhardt, and D. A. Wood, “Heterogeneous system coherence forintegrated CPU-GPU systems,” in IEEE/ACM International Symposium
on Microarchitecture, 2013.[26] “Opencl api,” https://www.khronos.org/opencl/.[27] P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar,
N. Y. Yang, G.-Y. Lueh, and H. Wang, “Exochi: Architecture andprogramming environment for a heterogeneous multi-core multithreadedsystem,” in Conference on Programming Language Design and Imple-
mentation, 2007.[28] A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling GPU-CPU work-
loads and systems,” in Workshop on General-Purpose Computation on
Graphics Processing Units, 2010.[29] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: An OpenCL
framework for heterogeneous CPU/GPU clusters,” in International Con-
ference on Supercomputing, 2012.[30] E. Chung, P. Milder, J. Hoe, and K. Mai, “Single-chip heterogeneous
computing: Does the future include custom logic, FPGAs, and GPG-PUs?” in International Symposium on Microarchitecture, 2010.
[31] A. Marowka, “Extending Amdahl’s law for heterogeneous computing,”in International Symposium on Parallel and Distributed Processing with
Applications, 2012.[32] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. Keller,
“Energy management for commercial servers,” Computer, vol. 36,no. 12, pp. 39–48, Dec 2003.
[33] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novel architecture of the3D stacked MRAM L2 cache for CMPs,” in International Symposium
on High Performance Computer Architecture, 2009, pp. 239–249.[34] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change
memory as a scalable DRAM alternative,” in International Symposium
on Computer Architecture, 2009, pp. 2–13.[35] C. Xu, X. Dong, N. Jouppi, and Y. Xie, “Design implications of
memristor-based RRAM cross-point structures,” in Design, Automation
Test in Europe Conference, 2011.[36] Y. Xie, “Future memory and interconnect technologies,” in Design,
Automation Test in Europe Conference Exhibition, 2013.[37] G. Sun, E. Kursun, J. Rivers, and Y. Xie, “Exploring the vulnerability
of CMPs to soft errors with 3D stacked non-volatile memory,” inInternational Conference on Computer Design, 2011.
[38] Y. Chen, J. Zhao, and Y. Xie, “3D-NonFAR: Three-dimensional non-volatile FPGA architecture using phase change memory,” in Interna-
tional Symposium on Low-Power Electronics and Design, 2010.[39] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-
mance main memory system using phase-change memory technology,”in International Symposium on Computer Architecture, ser. ISCA ’09.New York, NY, USA: ACM, 2009, pp. 24–33.
[40] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybridmemory systems,” in International Conference on Supercomputing,2011.
[41] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi,S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “Bigdatabench:A big data benchmark suite from internet services,” in International
Symposium on High Performance Computer Architecture, 2014.[42] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “The SPLASH-
2 programs: characterization and methodological considerations,” inInternational Symposium on Computer Architecture, 1995.
[43] M. Poremba and Y. Xie, “NVMain: An architectural-level main memorysimulator for emerging non-volatile memories,” in IEEE Computer
Society Annual Symposium on VLSI, 2012.[44] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling
efficient and scalable hybrid memories using fine-granularity DRAMcache management,” Computer Architecture Letters, vol. 11, pp. 61–64,2012.
[45] M. Pavlovic, N. Puzovic, and A. Ramirez, “Data placement in HPCarchitectures with heterogeneous off-chip memory,” in International
Conference on Computer Design, 2013.
9B-1
790