+ All Categories
Home > Documents > DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

Date post: 23-Feb-2016
Category:
Upload: shanta
View: 30 times
Download: 0 times
Share this document with a friend
Description:
DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance. Dang Tang, Yungang Bao , Weiwu Hu , Mingyu Chen 2010.1. Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS). The role of I/O. I/O is ubiquitous - PowerPoint PPT Presentation
Popular Tags:
42
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu Hu, Mingyu Chen 2010.1 Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)
Transcript
Page 1: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITU

TE OF CO

MPU

TING

TECH

NO

LOG

Y

DMA Cache Architecturally Separate I/O Data from

CPU Data for Improving I/O Performance

Dang Tang, Yungang Bao,Weiwu Hu, Mingyu Chen

2010.1

Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

Page 2: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

The role of I/O I/O is ubiquitous

Load binary files: Disk Memory Brower web, media stream: NetworkMemory…

I/O is significant Many commercial applications are I/O intensive:

Database etc.

Page 3: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

State-of-the-Art I/O Technologies I/O Bus: 20GB/s

PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect

I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

Page 4: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Direct Memory Access (DMA) DMA is used for I/O operations in all modern

computers

DMA allows I/O subsystems to access system memory independently of CPU.

 Many I/O devices have DMA engines Including disk drive controllers, graphics

cards, network cards, sound cards and GPUs

Page 5: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 6: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver BufferDescriptor

②③

Kernel Buffer ④

An Example of Disk Read:DMA Receiving Operation

• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles

Page 7: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver BufferDescriptor

②③

Kernel Buffer④

Direct Cache Access [Ram-ISCA05]

• This is a typical Shared-Cache SchemePrefetch-Hint Approach [Kumar-Micro07]

Page 8: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing

Not suitable for other I/O Degrade performance

when DMA requests are large (>100KB) for “Oracle + TPC-H” application

To address this problem deeply, we need to investigate the I/O data characteristics.

Page 9: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

I/O Data V.S. CPU Data

MemCtrlI/O Data

CPU Data

HMTT

I/O Data + CPU Data

Page 10: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

A short AD of HMTT [Bao-Sigmetrics08]

A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g.,

virtual address Process id I/O operation

Can collect the trace of commercial applications, e.g., Oracle Web server

The HMTT System

Page 11: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(1) % of Memory References to I/O data

% of References of various I/O types

Page 12: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(2) I/O request size distribution?

Page 13: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(3) Sequential access in I/O data

Compared with CPU data, I/O data is very regular

Page 14: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(4) Reuse Distance (RD)

LRU Stack Distance 1

3

2

4

1

2

2

3

3

4

4

3

1

1

2

1

2

4

3

1

2

3

4

1

2

3

1

2

1

2

3

1

1

2

4

RD

CDF

x%

<=n

Page 15: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(5)

DMA-W CPU-R

CPU-RW CPU-RW

CPU-W DMA-R

Page 16: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Rethink I/O & DMA Operation 20~40% of memory references are for I/O

data in I/O-intensive applications. Characteristics of I/O data are different from

CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential

Separating I/O data and CPU data

Page 17: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Separating I/O data and CPU data

Before Separating

After Separating

Page 18: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 19: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Dedicated DMA Cache (DDC)

Page 20: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through

policies are available Write Policy Cache Coherence Replacement Policy Prefetching

Page 21: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

IO-ESI Protocol

for WT policy

IO-M

OESI Protocol

for WB

Policy

The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

Page 22: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

A Big Issue

How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

Page 23: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98]

DMA $ CPU $ CPU $……O S IM I S

OS+I+ √ MS+I+ X

EI+

R|E

MI+W|*

S+I+R|I

Page 24: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Global State Cache Coherence Theorem   Given N (N>1) well-defined cache protocols,

they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine.

S+I+

EI+

I+

MI+

OS+I+

R|*

W|*

W|* R|I

R|M W|*

R|*

R|*

W|*

W|*

R|E

R|I

5 Global States:

S+I+

EI*

I*

MI*

OS*I*

√√√√√

Page 25: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

MOESI + ESI

S+I+

ECI+

I+

MCI+

EDI+

OCS+I+

R*|*

RC|E R*|I

WC|* WD|*

RC|I RD |I

WD|I

RD|* WD|*

RC|I

WC|*

Wc|I

WD|I

WC|I

WD|SI R*|I

WC|*

RC|* RD|SI

WD|* RD|E RC|M

WC|*

6 Global States:

S+I+

ECI*

I*

MCI*

EDI*

OCS*I*

√√√√√√

Page 26: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

An LRU-like Replace Policy1. Invalid 2. Shared 3. Owned4. Exlusive5. Modified

Page 27: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

Page 28: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity vs.Design Cost Dedicated DMA Cache (DDC)

Partition-Based DMA Cache

(PBDC)

Page 29: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 30: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Speedup of Dedicated DMA Cache

Page 31: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

% of Valid Prefetched Blocks

DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

Page 32: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Performance Comparisons

Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

Page 33: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 34: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Conclusions We have proposed a DMA cache technique to separate

I/O data and CPU We adopt a Global State Method for Integrating

Heterogeneous Cache Protocols Experimental results show that DMA Cache schemes are

better than the existing approaches that use unified, shared caches for I/O data and CPU data

Still Open Problems, e.g., Can I/O data goes direct to L1 cache? How to design heterogeneous caches for different

types of data? How to optimize MC with awareness of IO

Page 35: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGYThanks!&

Question?

Page 36: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

RTL Emulation Platform LLC and DMA cache Model from Loongson-2F DDR2 Memory Controller from Loongson-2F DDR2 DIMM model from Micron Technology

LL Cache

MemCtrl

DDR2 DIMM

DMA Cache

Memory trace

Page 37: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Parameters

DDR2-666

Page 38: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Normalized Speedup for WB

Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

Page 39: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Write & CPU Read Hit Rate

Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

Page 40: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Breakdown of Normalized Total Cycles

Page 41: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity of PBDC

Page 42: DMA Cache  Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

INSTITUTE OF COMPUTING

TECHNOLOGY

More References on Cache Coherence Protocol Verification Fong Pong , Michel Dubois, Formal

verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998

Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997


Recommended