Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish...

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data

Placement

Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis

School of Computing, University of Utah

ASPLOS-2010

2

DRAM Memory Constraints

• Modern machines spend nearly 25% - 40% of total system power for memory.• Some commercial servers already have larger power budgets for

memory than CPU.

• Main memory access is one of the largest performance bottlenecks.

We address both performance and power concerns for DRAM memory accesses.

3

DRAM Access Mechanism

…

Memory Controller

Memory bus or channel

RankDRAMchip ordeviceBank

Array

DIMM

CPU makes a memory request and the Memory Controller

converts it to appropriate DRAM commands.

Accesses within a device begin with selecting a bank,

then a row.

1/8th of therow buffer

One word ofdata outputRow

A few column bits are then selected from the row-buffer. These bits are then the output from the device.

Many bits read from the DRAM cells to service a

single CPU request!

4

DRAM Access Inefficiencies - I

• Over fetch due to large row-buffers.• 8 KB read into row buffer for a 64 byte cache line.

• Row-buffer utilization for a single request < 1%.

• Why are row buffers so large?• Large arrays minimize cost-per-bit.

• Striping a cache line across multiple chips (arrays) improves data transfer bandwidth.

5

DRAM Access Inefficiencies - II

• Open page policy• Row buffers kept open with the hope that subsequent requests will be

row buffer hits.

• FR-FCFS request scheduling (First-Ready FCFS)• Memory controller schedules requests to open row-buffers first.

• Diminishing locality in multi-cores.

Access Latency Access EnergyRow-buffer Hit ~ 75 cycles ~ 18 nJ

Row-buffer Miss ~ 225 cycles ~ 38 nJ

6

DRAM Row-buffer Hit-rates

With increasing core counts, DRAM row-buffer hit-rates reduce.

7

Key ObservationCache Block Access Pattern Within OS Pages

For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks.

8

Outline

DRAM Basics. Motivation.

• Basic Idea.

• Software Only Implementation (ROPS).

• Hardware Implementation (HAM).

• Results.

9

Basic IdeaGather all heavily accessed chunks of independent OS pages and map

them to the same DRAM row.

Hottest micro-pages

1 KB micro-pages

Coldest micro-pages

4 KB OS Pages

DRAM Memory

Reserved DRAM Region

10

Basic Idea

• Identifying “hot” micro-pages.• Memory controller counters and OS daemon.

• Reserved rows in DRAM for hot micro-pages.• Simplifies book-keeping overheads. 4MB capacity loss from a 4GB system (<

0.1%).

• EPOCH based schemes.• Expose EPOCH length to the OS for flexibility.

11

Software Only Implementation (ROPS)

VirtualAddress

X

CPU Memory Request

4 GB Main MemoryBaseline

Translation Lookaside

Buffer(TLB)

YPhysical Address Y

Reduced OS Page size (ROPS)

4 MB ReservedDRAM region

• Shrink the OS page size to 1KB• Every Epoch:

1. Migrate hot micro-pages.• TLB shoot-down and page table update.

2. Promote cold micro-pages to a superpage.• Page table/TLB updated.

Hot micro-pages

Cold micro-pages

Physical Address Z

Translation Lookaside

Buffer(TLB)

12

Software Only Implementation (ROPS)

• Reduced OS Page Size (ROPS).

• Throughout the system, reduce page size to 1KB size.

• Migrate hot micro-pages via DRAM-copy• Hot micro-pages live in the same row-buffer in the reserved DRAM region.

• Mitigate reduction in TLB reach by promoting cold micro-pages to 4KB superpages.

• Superpage creation facilitated by “reservation-based” page allocation.

• Allocate four 1KB micro-pages to contiguous DRAM frames.• Allows contiguous virtual addresses to be placed in contiguous physical

addresses → makes superpage creation easy.

13

Hardware Implementation (HAM)

PhysicalAddress

X

New addr . Y

4 GB Main MemoryCPU Memory Request

4 MB ReservedDRAM region

Y

X Page A

Mapping Table

X Y

Old Address New Address

BaselineHardware Assisted Migration (HAM)

14

Hardware Implementation (HAM) Hardware Assisted Migration (HAM). New level of address indirection

− Place data wherever you want in the DRAM.

Maintain a Mapping Table (MT) − Preserve old physical addresses of migrated micro-pages.

DRAM-copy of hot micro-pages to the reserved rows. Populate/update MT every EPOCH.

15

Results Schemes Evaluated

Baseline Oracle/Profiled:

Best-effort estimate of expected benefit in the next epoch based on a prior profile run. Epoch Based ROPS and HAM

Evaluated 5M, 10M, 50M, and 100M. Trends are similar, best perf. with 5M and 10M.

• Simics simulation platform.

• DRAMSim based DRAM timing.

• DRAM timing and energy figures from Micron datasheets.

CPU 4-core Out-of-Order CMP, 2 GHz freq.

L1 Inst. and Data Cache Private, 32 KB/2-way, 1-cycle access

L2 Unified Cache Shared, 128 KB/8-way, 10-cycle access

Total DRAM Capacity 4 GB

DIMM Configuration 8 DIMMs, 1 rank/DIMM, 64 bit channel, 8 devices/DIMM

Active Row-Buffers per DIMM 4

DIMM-Level Row-Buffer Size 8 KB

Simulation Parameters

16

ResultsAccesses to Micro-Pages in Reserved Rows in an Epoch

% of Total

accesses to micro-pages in reserved

rows

Total # 4KB

pages touched

in an Epoch.

% Accesses to micro-pages 4KB pages touched

17

Results5M cycle EPOCH, ROPS, HAM and ORACLE

Hardware assisted migration offers better returns due to lower TLB management overheads.

Apart from 9% perf. gains, our schemes also save energy at the same time!

Applications with room for improvement show average performance Improvement of 9%

Percent change in

performance

18

ResultsROPS, HAM and ORACLE

Energy consumption of the DRAM sub-system

% Reduction in DRAM energy

19

Conclusions

• On average, for applications with room for improvement and with our best performing scheme• Average performance ↑ 9% (max. 18%)

• Average memory energy consumption ↓ 18% (max. 62%).

• Average row-buffer utilization ↑ 38%

• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses.

• Future work• Can co-locate hot micro-pages that are accessed around the same

time.

20

That's all for today … Questions?

http://www.cs.utah.edu/arch-research

Date post:	19-Dec-2015
Category:	Documents
View:	215 times
Download:	0 times

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish...

Documents