IsolatingCPUandIOTrafficbyLeveragingaDual-Data-PortDRAM
DonghyukLeeLavanya Subramanian,Rachata Ausavarungnirun,
Jongmoo Choi,Onur Mutlu
DecoupledDirectMemoryAccess
2
processor
LogicalSystemOrganization
mainmemory
IOdevices
CPUaccess
IOaccess
MainmemoryconnectsprocessorandIOdevicesasanintermediatelayer
3
processor
PhysicalSystemImplementation
mainmemory
IOdevices
CPUaccess
IOaccess
IOaccess
HighPinCostinProcessor
HighContentioninMemoryChannel
4
processor
OurApproach
mainmemory
IOdevices
CPUaccess
EnablingIOchannel,decoupled & isolated fromCPUchannel
IOaccess
IOaccess
5
ExecutiveSummary• Problem
– CPUandIOaccessescontendforthesharedmemorychannel
• OurApproach:DecoupledDirectMemoryAccess(DDMA)– DesignnewDRAMarchitecturewithtwoindependentdataports
àDual-Data-PortDRAM– ConnectoneporttoCPUandtheotherporttoIOdevices
àDecoupleCPUandIOaccesses
• Application– Communicationbetweencomputeunits(e.g.,CPU–GPU)– In-memorycommunication(e.g.,bulkin-memorycopy/init.)– Memory-storagecommunication(e.g.,pagefault,IOprefetch)
• Result– Significantperformanceimprovement(20%in2ch.&2ranksystem)– CPUpincountreduction(4.5%)
6
Outline1.Problem
3.Dual-Data-PortDRAM
5.Evaluation
4.ApplicationsforDDMA
2.OurApproach
1.Problem
7
mainmemory
CPU
DMA
graphics
network
storage
USB
IOinterfacememorycontroller
MemoryChannelContentionDRAMChip
ProcessorChip
Problem1:MemoryChannelContention
DMAIOinterface
8
0%
20%
40%
60%
80%
100%TimeSpentonCPU-GPUCommunication
Benchmarks
33.5%onaverage
Fractio
nofExecutio
nTime
AlargefractionoftheexecutiontimeisspentonIOaccesses
Problem1:MemoryChannelContention
9
IntegratingIOinterfaceontheprocessorchipleadstohighareacost
ProcessorPinCount(w/opowerpins)
power memory(2ch)
IOinterface(10.6%)
IOinterface(28.4%)
others
memory(2ch)
(w/powerpins)ProcessorPinCount
959pinsintotal 359pinsintotal
Problem2:HighCostforIOInterfaces
10
SharedMemoryChannel
• MemorychannelcontentionforIOaccessandCPUaccess
• HighareacostforintegratingIOinterfacesonprocessorchip
11
Outline1.Problem
3.Dual-Data-PortDRAM
5.Evaluation
4.ApplicationsforDDMA
2.OurApproach
12
OurApproach
CPU
DMA
graphics
network
storage
USB
DRAMChip
mainmemory
?
DMACTRL.
DMAcontrol
ProcessorChip
controlchannel
Dual-Data-PortDRAM
Port1
Port2
memorycontroller IOinterface
DMAChip DMAIOinterface
13
OurApproach
?
CPU
graphics
network
storage
USB
DRAMChip
DMACTRL.
DMAcontrol
ProcessorChip
controlchannel
Dual-Data-PortDRAM
Port1
Port2
memorycontroller
DMAChip DMAIOinterface
IOACCESS
DecoupledDirectMemoryAccess
CPUACCESS
14
Outline1.Problem
3.Dual-Data-PortDRAM
5.Evaluation
4.ApplicationsforDDMA
2.OurApproach
15
peripherallogic
bank
Background:DRAMOperation
mem
orychannel
datachannel controlchannel
control
port
dataport
control
port
dataport
bank
activateread
bankbankREADY
DRAMperipherallogic:i)controlsbanks,andii)transfersdataovermemorychannel
memorycontrolleratCPU
16
bank
Problem:SingleDataPort
periphery
Requestsareservedseriallyduetosingledataport
datachannel controlchannel
control
port
dataport
read
control
port
dataport
bankREADY
bankREADY
dataport
read
ManyBanks
SingleDataPort
memorycontrolleratCPU
17
Problem:SingleDataPort
RD
DATA
RD
DATA
ControlPort
DataPort
time
RD
DATA
RDControlPort
DataPort1
time
DATADataPort2
WhataboutaDRAMwithtwodataports?
18
bank
periphery
twicethebandwidth&independentdataportswithlowoverhead
datachannel controlchanneldataport1
bank
bank
control
port
toPort1(upper)
toPort2(lower)
bankdatabus
portse
lectsignal
dataport2
datachannel
mux
mux
OverheadArea:1.6%↑Pins:20↑
Dual-Data-PortDRAM
19
DDP-DRAMMemorySystem
bank
periphery
CPUchannel controlchannelwithportselectdata
port1
bank
bank
control
port
dataport2
IOchannel
mux
mux
DDMAIOinterface
memorycontrolleratCPU
20
ThreeDataTransferModes
• CPUAccess:AccessthroughCPUchannel– DRAMread/writewithCPUportselection
• IOAccess:AccessthroughIOchannel– DRAMread/writewithIOportselection
• PortBypass:Directtransferbetweenchannels– DRAMaccesswithportbypassselection
21
1.CPUAccessMode
bank
periphery
CPUchannel
bank
control
port
dataport2
IOchannel
DDMAIOinterface
controlchannelwithportselect
mux
mux
dataport
bankREADY
memorycontrolleratCPU
read
control
port
CPUchanneldataport1
controlchannelwithCPUchannel
22
2.IOAccessMode
bank
periphery
CPUchannel
bank
control
portIOchannel
DDMAIOinterface
controlchannelwithportselect
mux
mux
dataport1
controlchannelwith IOchannel
memorycontrolleratCPU
IOchannel
dataportdataport2
bankREADY
read
control
port
23
3.PortBypassMode
bank
periphery
CPUchannel
bank
control
portIOchannel
controlchannelwithportselect
mux
mux
controlchannelwith portbypass
IOchannel
bank
dataport
dataport
dataport2
dataport1
CPUchannel
DDMAIOinterface
memorycontrolleratCPU
24
Outline1.Problem
3.Dual-Data-PortDRAM
5.Evaluation
4.ApplicationsforDDMA
2.OurApproach
25
ThreeApplicationsforDDMA
• Communicationb/wComputeUnits– CPU-GPUcommunication
• In-MemoryCommunicationandInitialization– Bulkpagecopy/initialization
• Communicationb/wMemoryandStorage– Servingpagefault/fileread&write
26
ctrl.channel
DDMActrl.
read
with
IOse
l.
CPU→
GPU
1.ComputeUnit↔ComputeUnitCPU
DDMActrl.
memorycontroller
DDP-DRAM
DDMAIOinterface
GPU
DDMActrl.
memorycontroller
DDP-DRAM
DDMAIOinterface
ctrl.channel
DDMActrl.
destination
DDMAIOinterface
source Ack.destination
DDMAIOinterface
write
with
IOse
l.
TransferdatathroughDDMAwithoutinterferingw/CPU/GPUmemoryaccesses
CPU
memorycontroller
GPU
memorycontroller
27
ctrl.chan.
read
with
IOse
l.write
with
IOse
l.
2.In-MemoryCommunication
DDMActrl.
CPU
DDMActrl.
memorycontroller
DDP-DRAM
DDMAIOinterface
sourcedestination
TransferdatainDRAMthroughDDAMwithoutinterferingwithCPUmemoryaccesses
CPU
memorycontroller
28
DDMActrl.
Acc.Storage
Ack.
3.Memory↔Storage
ctrl.chan.
write
with
IOse
l.
CPU
DDMActrl.
memorycontroller
DDP-DRAM
DDMAIOinterface StorageStorage(source)
destination
DDMAIOinterface
TransferdatafromstoragethroughDDMAwithoutinterferingwithCPUmemoryaccesses
destination
CPU
memorycontroller
29
Outline1.Problem
3.Dual-Data-PortDRAM
5.Evaluation
4.ApplicationsforDDMA
2.OurApproach
30
EvaluationMethods• System
– Processor:4– 16cores– LLC:16-wayassociative,512KBprivatecache-slice/core– Memory:1– 4ranksand1– 4channels
• Workloads– Memoryintensive:SPECCPU2006,TPC,stream(31benchmarks)
– CPU-GPUcommunicationintensive:polybench (8benchmarks)
– In-memorycommunicationintensive:apache,bootup,compiler,filecopy,mysql,fork,shell,memcached (8intotal)
31
0%
5%
10%
15%
20%
25%
4-Core 8-Core 16-Core0%
5%
10%
15%
20%
25%
4-Core 8-Core 16-Core
Perfo
rmanceIm
provem
ent
Perfo
rmanceIm
provem
ent
CPU-GPUComm.-Intensive In-MemoryComm.-Intensive
More performance improvementathighercorecountHighperformanceimprovement
Performance(2Channel,2Rank)
32
PerformanceonVariousSystems
0%5%10%15%20%25%30%35%40%
1rank 2rank 4rank0%5%10%15%20%25%30%35%40%
1ch 2ch 4ch
ChannelCount RankCount
Perfo
rmanceIm
provem
ent
Perfo
rmanceIm
provem
ent
Performanceincreaseswithrankcount
33
0
200
400
600
800
1000
1200
1ch 1chDDMA
2ch0%20%40%60%80%100%120%140%160%180%
1ch 1chDDMA
2ch
Perfo
rmance
ProcessorP
inCou
nt
DDMAachieveshigherperformanceatlowerprocessorpincount
959 915
1103
DDMAvs.DoublingChannel
34
Conclusion• Problem
– CPUandIOaccessescontendforthesharedmemorychannel
• OurApproach:DecoupledDirectMemoryAccess(DDMA)– DesignnewDRAMarchitecturewithtwoindependentdataports
àDual-Data-PortDRAM– ConnectoneporttoCPUandtheotherporttoIOdevices
àDecoupleCPUandIOaccesses
• Application– Communicationbetweencomputeunits(e.g.,CPU–GPU)– In-memorycommunication(e.g.,bulkin-memorycopy/init.)– Memory-storagecommunication(e.g.,pagefault,IOprefetch)
• Result– Significantperformanceimprovement(20%in2ch.&2ranksystem)– CPUpincountreduction(4.5%)
IsolatingCPUandIOTrafficbyLeveragingaDual-Data-PortDRAM
DonghyukLeeLavanya Subramanian,Rachata Ausavarungnirun,
Jongmoo Choi,Onur Mutlu
DecoupledDirectMemoryAccess
36
SystemOverhead
DDMAreducesmoreexpensiveon-chiparea,whileincreasinglessexpensiveoff-chiparea
processor
DRAM
IOdevices
ConventionalSystem
processor
DDP-DRAM
IOdevicesDDMA-IO
ProposedSystem
LowC
ost
High
37
0%10%20%30%40%50%60%70%80%90%100%
1Channel 2Channel 2Channel 1Rank 2Rank 4Rank
ChannelUtilizationAnalysis
SimultaneousChannelUtilizationàPerformanceImprovement
CPU-GPUCommunication-Intensive
ChannelU
tiliza
tion
CPU IO
CPU IO
CPU IO
CPU IO
CPU IO
CPU IO
0%10%20%30%40%50%60%70%80%90%100%
1Channel 2Channel 2Channel 1Rank 2Rank 4Rank
0%10%20%30%40%50%60%70%80%90%100%
1Channel 2Channel 2Channel 1Rank 2Rank 4Rank
0%10%20%30%40%50%60%70%80%90%100%
1Channel 2Channel 2Channel 1Rank 2Rank 4Rank
BothChannelsBusy SingleChannelBusy
4