Date post: | 21-Dec-2015 |
Category: |
Documents |
Upload: | annabelle-hancock |
View: | 221 times |
Download: | 2 times |
Toward Cache-FriendlyHardware Accelerators
Yakun Sophia Shao, Sam Xi,Viji Srinivasan, Gu-Yeon Wei, David Brooks
2
More accelerators.
Out-of-CoreAccelerators
Maltiel Consulting estimates
[Die photo from Chipworks][Accelerators annotated bySophia Shao @ Harvard]
Shao (Harvard) estimates
5
OMAP 4 SoC
Today’s SoC
ARM Cores GPUDSP DSP
System Bus
DMA
DMASDUSBAudio Video Face Imaging
USB
SPM SPM SPM SPM SPM SPM
SPM
6
Cache-Friendly Accelerator Interface
• Coherent Accelerator Processor Interface– Virtual Addressing & Data Caching– Easier, Natural Programming Model
Power 8
PCIe Bus
9
Not one size fits all.
• Different applications have different memory requirements.
• Need to customize their memory designs.
Accelerators
Infrastructure Building
GPU
Shared ResourcesMemoryInterface
Big Cores
Small Cores
GPGPU-Sim
gem5’s CPU Model gem5’s CPU
gem5’s Cache Model w/ Cactigem5’s DRAM Model
Private L1/Scratchpad
Aladdin
AcceleratorSpecific
Datapath
Shared Memory/InterconnectModels
UnmodifiedC-Code
Accelerator DesignParameters
(e.g., # FU, mem. BW)
Power/Area
Performance
“Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems
Programmability
Aladdin: A pre-RTL, Power-Performance Accelerator
Simulator
[ISCA’2014]http://vlsiarch.eecs.harvard.edu/
accelerators
12
Cache Customization
• TLB Designs:– TLB can be expensive.• Performance: TLB miss.• Resource/Power: Hardware TLB design.
– But accelerator’s TLB accesses are very likely to be regular.
15
Cache Customization
• TLB Designs:– TLB can be expensive.• Performance: TLB miss.• Resource/Power: Hardware TLB design.
– But accelerator’s TLB accesses are very likely to be regular.
• Cache Prefetcher Designs:
16
Inefficient Bulk Data Transfer
• DMA is very efficient in getting data.
• Cache fetches data at cache line granularity.
• Cache prefetcher customization.
Benchmark: kmp