Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data
Placement
Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis
School of Computing, University of Utah
ASPLOS-2010
2
DRAM Memory Constraints
• Modern machines spend nearly 25% - 40% of total system power for memory.• Some commercial servers already have larger power budgets for
memory than CPU.
• Main memory access is one of the largest performance bottlenecks.
We address both performance and power concerns for DRAM memory accesses.
3
DRAM Access Mechanism
…
Memory Controller
Memory bus or channel
RankDRAMchip ordeviceBank
Array
DIMM
CPU makes a memory request and the Memory Controller
converts it to appropriate DRAM commands.
Accesses within a device begin with selecting a bank,
then a row.
1/8th of therow buffer
One word ofdata outputRow
A few column bits are then selected from the row-buffer. These bits are then the output from the device.
Many bits read from the DRAM cells to service a
single CPU request!
4
DRAM Access Inefficiencies - I
• Over fetch due to large row-buffers.• 8 KB read into row buffer for a 64 byte cache line.
• Row-buffer utilization for a single request < 1%.
• Why are row buffers so large?• Large arrays minimize cost-per-bit.
• Striping a cache line across multiple chips (arrays) improves data transfer bandwidth.
5
DRAM Access Inefficiencies - II
• Open page policy• Row buffers kept open with the hope that subsequent requests will be
row buffer hits.
• FR-FCFS request scheduling (First-Ready FCFS)• Memory controller schedules requests to open row-buffers first.
• Diminishing locality in multi-cores.
Access Latency Access EnergyRow-buffer Hit ~ 75 cycles ~ 18 nJ
Row-buffer Miss ~ 225 cycles ~ 38 nJ
7
Key ObservationCache Block Access Pattern Within OS Pages
For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks.
8
Outline
DRAM Basics. Motivation.
• Basic Idea.
• Software Only Implementation (ROPS).
• Hardware Implementation (HAM).
• Results.
9
Basic IdeaGather all heavily accessed chunks of independent OS pages and map
them to the same DRAM row.
Hottest micro-pages
1 KB micro-pages
Coldest micro-pages
4 KB OS Pages
DRAM Memory
Reserved DRAM Region
10
Basic Idea
• Identifying “hot” micro-pages.• Memory controller counters and OS daemon.
• Reserved rows in DRAM for hot micro-pages.• Simplifies book-keeping overheads. 4MB capacity loss from a 4GB system (<
0.1%).
• EPOCH based schemes.• Expose EPOCH length to the OS for flexibility.
11
Software Only Implementation (ROPS)
VirtualAddress
X
CPU Memory Request
4 GB Main MemoryBaseline
Translation Lookaside
Buffer(TLB)
YPhysical Address Y
Reduced OS Page size (ROPS)
4 MB ReservedDRAM region
• Shrink the OS page size to 1KB• Every Epoch:
1. Migrate hot micro-pages.• TLB shoot-down and page table update.
2. Promote cold micro-pages to a superpage.• Page table/TLB updated.
Hot micro-pages
Cold micro-pages
Physical Address Z
Translation Lookaside
Buffer(TLB)
12
Software Only Implementation (ROPS)
• Reduced OS Page Size (ROPS).
• Throughout the system, reduce page size to 1KB size.
• Migrate hot micro-pages via DRAM-copy• Hot micro-pages live in the same row-buffer in the reserved DRAM region.
• Mitigate reduction in TLB reach by promoting cold micro-pages to 4KB superpages.
• Superpage creation facilitated by “reservation-based” page allocation.
• Allocate four 1KB micro-pages to contiguous DRAM frames.• Allows contiguous virtual addresses to be placed in contiguous physical
addresses → makes superpage creation easy.
13
Hardware Implementation (HAM)
PhysicalAddress
X
New addr . Y
4 GB Main MemoryCPU Memory Request
4 MB ReservedDRAM region
Y
X Page A
Mapping Table
X Y
Old Address New Address
BaselineHardware Assisted Migration (HAM)
14
Hardware Implementation (HAM) Hardware Assisted Migration (HAM). New level of address indirection
− Place data wherever you want in the DRAM.
Maintain a Mapping Table (MT) − Preserve old physical addresses of migrated micro-pages.
DRAM-copy of hot micro-pages to the reserved rows. Populate/update MT every EPOCH.
15
Results Schemes Evaluated
Baseline Oracle/Profiled:
Best-effort estimate of expected benefit in the next epoch based on a prior profile run. Epoch Based ROPS and HAM
Evaluated 5M, 10M, 50M, and 100M. Trends are similar, best perf. with 5M and 10M.
• Simics simulation platform.
• DRAMSim based DRAM timing.
• DRAM timing and energy figures from Micron datasheets.
CPU 4-core Out-of-Order CMP, 2 GHz freq.
L1 Inst. and Data Cache Private, 32 KB/2-way, 1-cycle access
L2 Unified Cache Shared, 128 KB/8-way, 10-cycle access
Total DRAM Capacity 4 GB
DIMM Configuration 8 DIMMs, 1 rank/DIMM, 64 bit channel, 8 devices/DIMM
Active Row-Buffers per DIMM 4
DIMM-Level Row-Buffer Size 8 KB
Simulation Parameters
16
ResultsAccesses to Micro-Pages in Reserved Rows in an Epoch
% of Total
accesses to micro-pages in reserved
rows
Total # 4KB
pages touched
in an Epoch.
% Accesses to micro-pages 4KB pages touched
17
Results5M cycle EPOCH, ROPS, HAM and ORACLE
Hardware assisted migration offers better returns due to lower TLB management overheads.
Apart from 9% perf. gains, our schemes also save energy at the same time!
Applications with room for improvement show average performance Improvement of 9%
Percent change in
performance
19
Conclusions
• On average, for applications with room for improvement and with our best performing scheme• Average performance ↑ 9% (max. 18%)
• Average memory energy consumption ↓ 18% (max. 62%).
• Average row-buffer utilization ↑ 38%
• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses.
• Future work• Can co-locate hot micro-pages that are accessed around the same
time.