+ All Categories
Home > Documents > NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o...

NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o...

Date post: 24-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
25
NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS ACM/IEEE NOCS 2018 Torino, Italy Davide Giri Paolo Mantovani Luca P. Carloni Columbia University New York, USA
Transcript
Page 1: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

NOC-BASED SUPPORT OF HETEROGENEOUSCACHE-COHERENCE MODELS FOR ACCELERATORS

ACM/IEEE NOCS 2018

Torino, Italy

Davide Giri

Paolo Mantovani

Luca P. Carloni

Columbia University

New York, USA

Page 2: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

SOC TRENDS

o Heterogeneity

o Custom accelerators

o NoC

o Shared memory

ACM/IEEE NOCS 2018, TORINO, ITALY 2October 4th, 2018

NVIDIA Parker, 2016.

Xilinx Everest, 2018.

Mobileye EyeQ5, 2020.

Qualcomm Snapdragon 835, 2017.

Challenges

o Scalability

o Programmability

Page 3: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

Major speedups and energy savings:

o Highly parallel and customized datapath

o Aggressively banked private local memory

(PLM)

LOOSELY-COUPLED ACCELERATORS

What should the cache coherence model for accelerators be?

o We identified 3 main models in literature

ACM/IEEE NOCS 2018, TORINO, ITALYOctober 4th, 2018 3

Page 4: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

Coherent with entire cache hierarchy

o Same coherence model as the processor

Programming requirements

o Race free accelerator execution

Implementation variants

o Generally bus-based

o Accelerators may own a cache

v IBM CAPI, [Y. Shao et al., MICRO ‘16], [M. J. Lyons et

al., TACO ‘12]

× ARM ACE-lite

ACCELERATOR MODELS: FULLY COHERENT

ACM/IEEE NOCS 2018, TORINO, ITALYOctober 4th, 2018 4

Page 5: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

Not coherent with cache hierarchy

o Caches are by-passed

Programming requirements

o Race free accelerator execution

o Flush all caches prior to accelerator execution

Implementation variants

o Generally NoC-based and DMA-based

o [Y. Chen et al., ICCD ‘13], [E. Cota et al., DAC ‘15]

[Y. Shao et al., MICRO ‘16]

ACCELERATOR MODELS: NON COHERENT

ACM/IEEE NOCS 2018, TORINO, ITALYOctober 4th, 2018 5

Page 6: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

Coherent with LLC only

o Processors’ private caches are by-passed

Programming requirements

o Race free accelerator execution

o Flush processors’ private caches prior to

accelerator execution

Implementation variants

o No implementation in literature

o First proposed by [E. Cota et al., DAC ‘15]

ACCELERATOR MODELS: LLC COHERENT

ACM/IEEE NOCS 2018, TORINO, ITALYOctober 4th, 2018 6

Page 7: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

CONTRIBUTIONS

Protocol.

o Variation of MESI to support 3 coherence models for accelerators (NoC-based)

Coherence Models.

o Show how each model can outperform the others in some cases

o Show that the best choice of model varies at runtime

Architecture. Design of a multi-core NoC-based architecture that supports:

o Three models of coherence for accelerators

o Run-time selection of the coherence model for each accelerator

o Coexistence of heterogeneous coherence models for accelerators

ACM/IEEE NOCS 2018, TORINO, ITALY 7October 4th, 2018

Page 8: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

OUR SOC PLATFORM

Our design is based on an instance of Embedded Scalable Platforms (ESP) [L. P. Carloni, DAC ‘16]

o Socketed tiles

o NoC

o Easy integration and reuse of heterogeneous components

ACM/IEEE NOCS 2018, TORINO, ITALY 8October 4th, 2018

We added a cache hierarchy to ESP

o Now it can run multi-processor and multi-accelerator applications on Linux SMP

Page 9: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

ESP: NOC

o 2D-mesh

o 1 cycle hops

o 6 physical planes to prevent deadlock and to provide sufficient bandwidth

o Point-to-point ordering required to prevent deadlock

ACM/IEEE NOCS 2018, TORINO, ITALY 9October 4th, 2018

Page 10: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

Main components

o Single processor core

o L2 private cache

o Added for this work

In this work

o Up to 2 processor tiles

o 64KB private caches

o Off-the-shelf processor with L1 write-through caches

ESP: PROCESSOR TILE

ACM/IEEE NOCS 2018, TORINO, ITALY 10October 4th, 2018

Page 11: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

ESP: MEMORY TILE

Main components

o Memory controller

o LLC and directory

o Added for this work

o Can be split over multiple tiles

In this work

o Up to 2 memory tiles

o Up to 2MB aggregate LLC

ACM/IEEE NOCS 2018, TORINO, ITALY 11October 4th, 2018

Page 12: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

ESP: ACCELERATOR TILE

Main components

o Any accelerator complying with a

simple interface

o A small TLB

o A DMA controller and/or a

private cache (added for this

work)

ACM/IEEE NOCS 2018, TORINO, ITALY 12October 4th, 2018

Support for run-time selection of coherence model through one I/O write to the configuration registers

Page 13: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

OUR PROTOCOL

We modified a classic MESI directory-based cache-coherence protocol

o to make it work over a NoC (atomic operations)

o to support all coherence models for accelerators (recalls, flush, LLC-coherent requests)

ACM/IEEE NOCS 2018, TORINO, ITALY 13October 4th, 2018

Directory controller

o Write-back: add a Valid state and dirty bit

o Recalls

o Flush

o LLC-coherent read/write requests

Private cache controller

o L1 invalidation

o Recalls

o Flush

o Atomic operations

Page 14: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

OUR PROTOCOL: DIRECTORY CONTROLLER EXCERPT

ACM/IEEE NOCS 2018, TORINO, ITALY 14October 4th, 2018

\ Requests

State \

LLC-coherent Read LLC-coherent Write

Invalid

Read memory

Send data to requestor

Go to Valid state

Read memory if misaligned

Write to LLC

Go to Valid state

Valid Send data to requestor Write to LLC

Shared - -

Exclusive - -

Modified - -

Page 15: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

EXPERIMENTAL SETUP

We designed 4 custom accelerators:

o Sort (merge and bitonic sort combined)

o Sparse Matrix-Vector Multiplication

o FFT-1D and FFT-2D

These accelerators represent a good mix of memory access patterncharacteristics:

o Varying footprint size (32KB – 20MB)

o Streaming vs. irregular pattern

o Temporal and spatial locality

ACM/IEEE NOCS 2018, TORINO, ITALY 15October 4th, 2018

ESP’s GUI:The CAD flow from GUI to bitstream is fully automated.

We deployed our SoC on FPGA and we executed

applications on Linux SMP.

Page 16: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

RESULTS: SINGLE ACCELERATOR

ACM/IEEE NOCS 2018, TORINO, ITALY 16October 4th, 2018

LLCwinning

0.5x

20x

LLCwinning

LLCwinning

LLCwinning

Speedup

DRAM accesses

NC = non-coherent

LLC = LLC-coherent

Page 17: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

RESULTS: MULTIPLE ACCELERATORS

ACM/IEEE NOCS 2018, TORINO, ITALY 17October 4th, 2018

Speedup

DRAM accesses

NC = non-coherent

LLC = LLC-coherent

Dataset size: 256KB to 512KB

Page 18: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

RESULTS: FULLY-COHERENT ACCELERATORS

ACM/IEEE NOCS 2018, TORINO, ITALY 18October 4th, 2018

The fully-coherent model can win for workloads whose data structures fit the accelerator’s private cache.

No flush needed.

FCwinning

FCwinning

NC = non-coherent

LLC = LLC-coherent

FC = fully-coherent

Speedup

Page 19: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

RESULTS: SUMMARY

o The best coherence model varies with the accelerator workload size and with the number of active accelerators in the system.

o LLC-coherent and fully-coherent models can significantly reduce accesses to DRAM.

ACM/IEEE NOCS 2018, TORINO, ITALY 19October 4th, 2018

private cache size LLC size

~ memory

footprint of

workload

fully-coherent

model

LLC-coherent

model

non-coherent

model

BEST

MODEL

RULE OF THUMB

Page 20: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

CONCLUSIONS

o There is no absolute winner among the coherence models.

o Workload size, caches size and number of active accelerators influences the best choice → Hence, the best choice can vary at runtime.

oWe proposed a cache-coherence protocol that supports all three coherence models in a NoC-based SoC:

o Fully-coherent, LLC-coherent, non-coherent.

oWe designed a NoC-based SoC architecture enabling

o Coexistence of heterogeneous coherence models operating simultaneously.

o Run-time selection of the coherence model for each accelerator.

ACM/IEEE NOCS 2018, TORINO, ITALY 20October 4th, 2018

Page 21: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

THANK YOU!

NOC-BASED SUPPORT OF HETEROGENEOUS

CACHE-COHERENCE MODELS FOR ACCELERATORS

Any question?

Davide Giri

Paolo Mantovani

Luca P. Carloni

Page 22: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

BACKUP

October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 22

Page 23: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

ESP: PROGRAMMABILITY

o The accelerator driver is invoked by an application to offload a task.

o Accelerator tiles handle virtual memory without interrupting the processor cores

o We use locks to enforce race free execution of the accelerators. Additionally:o During the execution of non-coherent accelerators,

we ensure that there exists only a single copy of the data.

o For LLC-coherent accelerators data can be present both in DRAM and in the LLC.

o The flush phase becomes a negligible overhead for large accelerator workloads

ACM/IEEE NOCS 2018, TORINO, ITALY 23October 4th, 2018

Page 24: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

ESP: CACHES

o Designed in SystemC and implemented through HLS.

o Configurable sets, ways and the number of sharers and owners.

o The device driver can select which caches to flush.

For this work:

o LLC: 2 MB

o Private caches: 64KB

ACM/IEEE NOCS 2018, TORINO, ITALY 24October 4th, 2018

Page 25: NOC-BASED SUPPORT OF HETEROGENEOUS …davide_giri/pdf/giri_nocs18_slides.pdfSOC TRENDS o Heterogeneity o Custom accelerators o NoC o Shared memory October 4th, 2018 ACM/IEEE NOCS 2018,

OUR PROTOCOL: DIRECTORY CONTROLLER EXCERPT

o Put the whole table and list of features: Valid state, Recalls, DMA requests.

o Make an example with timing diagram a zig zag, basically for a DMA request.

o Slide with list of features for L2.

ACM/IEEE NOCS 2018, TORINO, ITALY 25October 4th, 2018


Recommended