+ All Categories
Home > Documents > Partitioning and Analysis of the Network- on-Chip on a...

Partitioning and Analysis of the Network- on-Chip on a...

Date post: 05-Feb-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
125
Partitioning and Analysis of the Network- on-Chip on a COTS Many-Core Platform Matthias Becker, Borislav Nicolić, Dakshina Dasari, Benny Åkesson, Vincent Nélis, Moris Behnam, Thomas Nolte RTAS, Pittsburgh 18. April 2017
Transcript
  • Partitioning and Analysis of the Network-

    on-Chip on a COTS Many-Core Platform

    Matthias Becker, Borislav Nicolić, Dakshina Dasari, Benny Åkesson, Vincent Nélis, Moris Behnam, Thomas Nolte

    RTAS, Pittsburgh 18. April 2017

  • 31

    Many-Core processors developed with large core count (64, 256, 1024 cores).

  • 32

    Many-Core processors developed with large core count (64, 256, 1024 cores).

    How to use it?

  • 33

    Many-Core processors developed with large core count (64, 256, 1024 cores).

    How to use it?

  • 34

    Many-Core processors developed with large core count (64, 256, 1024 cores).

    Execute Large Applications that Utilize all/many Cores

    How to use it?

  • 35

    Many-Core processors developed with large core count (64, 256, 1024 cores).

    Execute Large Applications that Utilize all/many Cores

    Consolidate Many Applications on the Cores/Cluster

    How to use it?

  • 36

    Many-Core processors developed with large core count (64, 256, 1024 cores).

    Execute Large Applications that Utilize all/many Cores

    Consolidate Many Applications on the Cores/Cluster

    How to use it?

  • 37

    Many-Core processors developed with large core count (64, 256, 1024 cores).

    Execute Large Applications that Utilize all/many Cores

    Consolidate Many Applications on the Cores/Cluster

    How to use it?

  • ● System Model

    ● Motivation

    ● Partitioning the NoC

    ● WCTT Analysis for the partitioned NoC

    ● Setting the traffic shaping parameters

    ● Evaluation

    ● Conclusions

    38

    Outline

  • 39

    Kalray MPPA Many-Core PlatformOverview

  • ● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

    ● 4 IO/Subsystems● Each containing 4 Compute Cores

    40

    Kalray MPPA Many-Core PlatformOverview

  • ● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

    ● 4 IO/Subsystems● Each containing 4 Compute Cores

    41

    Kalray MPPA Many-Core PlatformOverview

  • ● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

    ● 4 IO/Subsystems● Each containing 4 Compute Cores

    43

    Kalray MPPA Many-Core PlatformOverview

    Clu

    ste

    r

  • ● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

    ● 4 IO/Subsystems● Each containing 4 Compute Cores

    45

    Kalray MPPA Many-Core PlatformOverview

    Clu

    ste

    r

  • ● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

    ● 4 IO/Subsystems● Each containing 4 Compute Cores

    46

    Kalray MPPA Many-Core PlatformOverview

    Clu

    ste

    r

    We

    st -

    Ethe

    rne

    t

    East

    -Et

    hern

    et

  • ● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

    ● 4 IO/Subsystems● Each containing 4 Compute Cores

    47

    Kalray MPPA Many-Core PlatformOverview

    Clu

    ste

    r

    North - DDR

    South - DDRW

    est

    -Et

    hern

    et

    East

    -Et

    hern

    et

  • ● 2D-Torus Topology ● 2 topologically identical NoCs● D-NoC for data communication● C-NoC for control messages

    48

    I/O Subsystem DDR0

    I/O Subsystem DDR1

    I/O

    Sub

    syst

    em E

    ther

    net 1

    I/O Subsystem

    Ethernet 1Kalray MPPA Many-Core PlatformThe Network-on-Chip (1)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    49

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    50

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    51

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    52

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    53

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    54

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    55

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    56

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    57

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    58

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    59

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

    North

    South

    We

    st

    East

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    60

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

    North

    South

    We

    st

    East

  • ● Wormhole Switching● Output Buffer● Round Robin Arbitration

    61

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)

    North

    South

    West

    Cluster

    East

    Roun

    d R

    ob

    in

    FIFO RR

    North

    South

    We

    st

    East

  • 62

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • 63

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • 64

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • 65

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • 66

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • 67

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • 68

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    69

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    70

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    71

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

    Packet Shaper

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    72

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

    Packet Shaper

    …Application Payload

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    73

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

    Packet Shaper

    …Application Payload

    H H H…NoC Packets

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    74

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

    Packet Shaper

    …Application Payload

    H H H…NoC Packets

    NoC

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    75

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

    Packet Shaper Traffic Limiter

    …Application Payload

    H H H…NoC Packets

    NoC

  • ● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

    76

    Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)

    Packet Shaper Traffic Limiter

    …Application Payload

    H H H…NoC Packets

    NoC

    ● Window Size !"● Bandwidth Quota #

  • ● Applications on each cluster need to access the NoC● Exchanging messages● Accessing off-chip memory

    ● Applications operate on read-execute-write semantic

    77

    Application Model (1)

  • 82

    Application Model

    ● Each application has a number of:● Read requests● Write requests

  • 83

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

  • 84

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    No

    C

  • 85

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    No

    CIO

  • 86

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    No

    CIO

  • 87

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    Write

    No

    CIO

  • 88

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    Write

    No

    CIO

    … …

  • 89

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    Write

    No

    CIO

    … …

    ∆%&'

  • 90

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    Write

    No

    CIO

    … …

    ∆%&' ∆%&()

  • 91

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    Write

    No

    CIO

    … …

    ∆%&' ∆%&()∆"*+,&

  • 92

    Application Model

    Read

    ● Each application has a number of:● Read requests● Write requests

    Write

    No

    CIO

    … …

    ∆%&' ∆%&()∆"*+,&

  • 93

    Motivation

    Application NoC

    Application

  • 94

    Motivation

    Application NoC

    Application

  • 95

    Motivation

    Application NoC

    Application

  • 96

    Motivation

    Application NoC

    Application

    ● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,

    traffic limiter, routing, …)

  • 98

    Motivation

    Application NoC

    Application

    ● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,

    traffic limiter, routing, …)

    Pessimistic estimates à Larger tasks WCET à Less efficient platform usage

  • 99

    Contributions

    NoC organization that reduces contention by partitioning

  • 100

    Contributions

    NoC organization that reduces contention by partitioning

    Timing analysis for the partitioned NoC

  • 101

    Contributions

    NoC organization that reduces contention by partitioning

    Timing analysis for the partitioned NoC

    A method to configure the flow regulation on source nodes

  • 102

    Partitioning the NoC

    I/O Subsystem DDR0

    I/O Subsystem DDR1

    I/O

    Sub

    syst

    em E

    ther

    net 1

    I/O Subsystem

    Ethernet 1

  • 103

    Partitioning the NoC

    I/O Subsystem DDR0

    I/O Subsystem DDR1

    I/O

    Sub

    syst

    em E

    ther

    net 1

    I/O Subsystem

    Ethernet 1

  • 104

    Partitioning the NoC

    I/O Subsystem DDR0

    I/O Subsystem DDR1

    I/O

    Sub

    syst

    em E

    ther

    net 1

    I/O Subsystem

    Ethernet 1

  • 105

    Partitioning the NoC

  • 106

    Partitioning the NoC

    ● Avoid horizontal communication● Cluster communicate with the closest I/O

    subsystem● Each cluster sends messages via the I/O

    subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the

    I/O subsystem

    ● 8 identical NoC partitions

  • 107

    Partitioning the NoC

    ● Avoid horizontal communication● Cluster communicate with the closest I/O

    subsystem● Each cluster sends messages via the I/O

    subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the

    I/O subsystem

    ● 8 identical NoC partitions

  • 108

    Partitioning the NoC

    ● Avoid horizontal communication● Cluster communicate with the closest I/O

    subsystem● Each cluster sends messages via the I/O

    subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the

    I/O subsystem

    ● 8 identical NoC partitions

  • 110

    WCTT Analysis in the Partitioned NoCOverview

  • ● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00

    111

    WCTT Analysis in the Partitioned NoCOverview

  • ● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00

    112

    WCTT Analysis in the Partitioned NoCOverview

    !"#

    $!%A

    B

    AA

    !&' BB

    Cluster A

    Cluster B

    I/O System

    Compute Cluster to I/O

  • ● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00

    113

    WCTT Analysis in the Partitioned NoCOverview

    !

    "#

    "$A

    %

    &B

    A

    B

    "'B

    Cluster A

    Cluster B

    I/O System

    I/O to Compute Cluster

    A

  • ● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00

    114

    WCTT Analysis in the Partitioned NoCOverview

    !

    "#

    "$A

    %

    &B

    A

    B

    "'B

    Cluster A

    Cluster B

    I/O System

    I/O to Compute Cluster

    A

  • 115

    WCTT Analysis in the Partitioned NoC./!!00→56 (1)

    !"#

    $!%A

    B

    AA

    !&' BB

    Cluster A

    Cluster B

    I/O System

    Compute Cluster to I/O

  • 116

    WCTT Analysis in the Partitioned NoC./!!00→56 (1)

    !"#

    $!%A

    B

    AA

    !&' BB

    Cluster A

    Cluster B

    I/O System

    Compute Cluster to I/O

    Flow Regulation Delay

  • 117

    WCTT Analysis in the Partitioned NoC./!!00→56 (1)

    !"#

    $!%A

    B

    AA

    !&' BB

    Cluster A

    Cluster B

    I/O System

    Compute Cluster to I/O

    Flow Regulation Delay Round Robin Delay

  • 118

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 119

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 120

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    789:

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 121

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    789:

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    @ABB is not minimal

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 122

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    789: 78;<

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    @ABB is not minimal

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 123

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    789: 78;<

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    @ABB is not minimalBuffer in router

    overflows

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 124

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    789: 78;<

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    @ABB is not minimalBuffer in router

    overflows

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 125

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0

    20000

    40000

    60000

    80000

    100000

    67 117 167 217 267 317 367 417 467 517 567

    BufferOccupation[flit]

    WCTT[cycles]

    Nmax

    WCTT

    Max.Buffer

    AvailableBuffer

    FlowRegulationBudget# [flit]

    789: 78;<

    ● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

    @ABB is not minimalBuffer in router

    overflows

    WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

  • 126

    WCTT Analysis in the Partitioned NoC./!!00→56 (3)

    ● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

  • 127

    WCTT Analysis in the Partitioned NoC./!!00→56 (3)

    ./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M

    ● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

  • 128

    WCTT Analysis in the Partitioned NoC./!!00→56 (3)

    ./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M

    RR-blocking and transmission from the buffer of all but the last packet

    ● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

  • 129

    WCTT Analysis in the Partitioned NoC./!!00→56 (3)

    ./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M

    RR-blocking and transmission from the buffer of all but the last packet

    RR-blocking of the last packet

    ● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

  • 130

    WCTT Analysis in the Partitioned NoC./!!00→56 (3)

    ./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M

    RR-blocking and transmission from the buffer of all but the last packet

    RR-blocking of the last packet

    Transmission of the last packet over the NoCwithout interference

    ● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

  • 131

    WCTT Analysis in the Partitioned NoC./!!00→56 (3)

    ./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M

    RR-blocking and transmission from the buffer of all but the last packet

    RR-blocking of the last packet

    Transmission of the last packet over the NoCwithout interference

    ● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

  • ● Two cases:

    ● Find #N+O such that the ./!!00→56 is minimum

    ● Find #N(P such that the buffer in >? does not overflow

    132

    Determining the parameters for the traffic limiter (1)

    B@, 7

  • ● Two cases:

    ● Find #N+O such that the ./!!00→56 is minimum

    ● Find #N(P such that the buffer in >? does not overflow

    133

    Determining the parameters for the traffic limiter (1)

    B@, 7

  • 134

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Time [cycles]

    Dat

    a [fl

    it]

  • 135

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Time [cycles]

    Dat

    a [fl

    it]

  • 136

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

  • 137

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

  • 138

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

    Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

  • 139

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

    Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

  • 140

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

    Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

    !" + DE ≤#

    DE

    H DG + #

  • 141

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

    Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

    !" + DE ≤#

    DE

    H DG + #Binary Search, ILP, …

  • 142

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

    Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

    !" + DE ≤#

    DE

    H DG + #Binary Search, ILP, …

  • 143

    Determining the parameters for the traffic limiter (2) - #N+O

    ● Buffer >?

    Flits that arrive in the buffer, shaped by

    traffic limiter

    Flits that depart from the buffer, shaped by the RR-interference

    Time [cycles]

    Dat

    a [fl

    it]

    Departure Segment R

    Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

    !" + DE ≤#

    DE

    H DG + #Binary Search, ILP, …

  • 144

    Evaluation

  • ● Experiments evaluate different aspects of the work● Measurements on the Kalray MPPA platform● Case study of an engine management system

    ● All experiments based parameters of the Kalray MPPA● D-NoC packet payload = 62 flit● C-NoC packet payload = 2 flit● Header size = 4 flit● Router● Switching delay = 1 cycle● Channel delay = 1 cycle● Buffer size = 401 flit

    145

    Evaluation

  • 146

    EvaluationTotal Read Latency on the MPPA (1)

  • 147

    EvaluationTotal Read Latency on the MPPA (1)

    ● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

    ● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

    ● Latency on cluster 0 and 4 are observed● They represent one NoC partition

    ● Each data point represents the maximum observed value out of 10000 samples

  • 148

    EvaluationTotal Read Latency on the MPPA (1)

    ● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

    ● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

    ● Latency on cluster 0 and 4 are observed● They represent one NoC partition

    ● Each data point represents the maximum observed value out of 10000 samples

  • 149

    EvaluationTotal Read Latency on the MPPA (1)

    ● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

    ● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

    ● Latency on cluster 0 and 4 are observed● They represent one NoC partition

    ● Each data point represents the maximum observed value out of 10000 samples

  • 150

    EvaluationTotal Read Latency on the MPPA (1)

    ● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

    ● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

    ● Latency on cluster 0 and 4 are observed● They represent one NoC partition

    ● Each data point represents the maximum observed value out of 10000 samples

  • 152

    EvaluationTotal Read Latency on the MPPA (1)

    ● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

    ● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

    ● Latency on cluster 0 and 4 are observed● They represent one NoC partition

    ● Each data point represents the maximum observed value out of 10000 samples

  • 153

    EvaluationTotal Read Latency on the MPPA (1)

    ● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

    ● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

    ● Latency on cluster 0 and 4 are observed● They represent one NoC partition

    ● Each data point represents the maximum observed value out of 10000 samples

  • 154

    EvaluationTotal Read Latency on the MPPA (2)

    0

    10

    20

    30

    40

    50

    60

    1KB 2KB 4KB 8KB 16KB

    Latencyinm

    s

    16Clusters- C0 16Clusters- C4 8Clusters- C0 8Clusters- C4

    4Clusters- C0 4Clusters- C4 2Clusters- C0 2Clusters- C4

  • ● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)

    ● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory

    155

    EvaluationSimulation based Case Study (1)

  • ● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)

    ● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory

    156

    EvaluationSimulation based Case Study (1)

    Clu

    ster

    Clu

    ster I/O Subsys.

  • ● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)

    ● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory

    157

    EvaluationSimulation based Case Study (1)

    EMS

    EMS

    Clu

    ster

    Clu

    ster I/O Subsys.

  • ● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)

    ● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory

    158

    EvaluationSimulation based Case Study (1)

    EMS

    EMS

    RC

    Clu

    ster

    Clu

    ster I/O Subsys.

  • ● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)

    ● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory

    159

    EvaluationSimulation based Case Study (1)

    EMS

    EMS

    RC

    Reorder Core to manage memory requests

    Clu

    ster

    Clu

    ster I/O Subsys.

  • 160

    EvaluationSimulation based Case Study (2)

    0

    2000

    4000

    6000

    8000

    10000

    M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

    Latency[cycle]

    Analysis Max Isolation

    Writing to memory./!!00→56

  • 161

    EvaluationSimulation based Case Study (2)

    0

    2000

    4000

    6000

    8000

    10000

    M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

    Latency[cycle]

    Analysis Max Isolation

    0

    2000

    4000

    6000

    8000

    10000

    M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

    Latency[cycle]

    Analysis Max Isolation

    Writing to memory./!!00→56

    Reading from memory./!!01230

    +./!!56→00

  • ● Shared NoC is one of the main sources for interference

    ● Difficult to analyze due to different architectural features

    ● Novel NoC partitioning scheme to reduce interference and easy analysis

    ● Tailored analysis for the partition● Configuration of traffic limiter to avoid buffer overflow

    and to guarantee minimum transmission times

    ● Focus on the memory access within the I/O subsystem● Handling of requests affects the overall latency of

    memory access

    164

    Conclusions and Future Work

  • Thank you for the attention!Questions?


Recommended