Partitioning and Analysis of the Network-
on-Chip on a COTS Many-Core Platform
Matthias Becker, Borislav Nicolić, Dakshina Dasari, Benny Åkesson, Vincent Nélis, Moris Behnam, Thomas Nolte
RTAS, Pittsburgh 18. April 2017
31
Many-Core processors developed with large core count (64, 256, 1024 cores).
32
Many-Core processors developed with large core count (64, 256, 1024 cores).
How to use it?
33
Many-Core processors developed with large core count (64, 256, 1024 cores).
How to use it?
34
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
How to use it?
35
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
Consolidate Many Applications on the Cores/Cluster
How to use it?
36
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
Consolidate Many Applications on the Cores/Cluster
How to use it?
37
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
Consolidate Many Applications on the Cores/Cluster
How to use it?
● System Model
● Motivation
● Partitioning the NoC
● WCTT Analysis for the partitioned NoC
● Setting the traffic shaping parameters
● Evaluation
● Conclusions
38
Outline
39
Kalray MPPA Many-Core PlatformOverview
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
40
Kalray MPPA Many-Core PlatformOverview
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
41
Kalray MPPA Many-Core PlatformOverview
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
43
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
45
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
46
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
We
st -
Ethe
rne
t
East
-Et
hern
et
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
47
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
North - DDR
South - DDRW
est
-Et
hern
et
East
-Et
hern
et
● 2D-Torus Topology ● 2 topologically identical NoCs● D-NoC for data communication● C-NoC for control messages
48
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1Kalray MPPA Many-Core PlatformThe Network-on-Chip (1)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
49
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
50
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
51
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
52
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
53
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
54
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
55
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
56
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
57
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
58
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
● Wormhole Switching● Output Buffer● Round Robin Arbitration
59
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
North
South
We
st
East
● Wormhole Switching● Output Buffer● Round Robin Arbitration
60
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
North
South
We
st
East
● Wormhole Switching● Output Buffer● Round Robin Arbitration
61
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
North
South
West
Cluster
East
Roun
d R
ob
in
FIFO RR
North
South
We
st
East
62
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
63
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
64
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
65
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
66
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
67
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
68
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
69
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
70
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
71
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
72
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
…Application Payload
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
73
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
…Application Payload
H H H…NoC Packets
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
74
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
…Application Payload
H H H…NoC Packets
NoC
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
75
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper Traffic Limiter
…Application Payload
H H H…NoC Packets
NoC
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
76
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper Traffic Limiter
…Application Payload
H H H…NoC Packets
NoC
● Window Size !"● Bandwidth Quota #
● Applications on each cluster need to access the NoC● Exchanging messages● Accessing off-chip memory
● Applications operate on read-execute-write semantic
77
Application Model (1)
82
Application Model
● Each application has a number of:● Read requests● Write requests
83
Application Model
Read
● Each application has a number of:● Read requests● Write requests
84
Application Model
Read
● Each application has a number of:● Read requests● Write requests
No
C
85
Application Model
Read
● Each application has a number of:● Read requests● Write requests
No
CIO
86
Application Model
Read
● Each application has a number of:● Read requests● Write requests
No
CIO
…
87
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
…
88
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
89
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&'
90
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&' ∆%&()
91
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&' ∆%&()∆"*+,&
92
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&' ∆%&()∆"*+,&
93
Motivation
Application NoC
Application
94
Motivation
Application NoC
Application
95
Motivation
Application NoC
Application
96
Motivation
Application NoC
Application
● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,
traffic limiter, routing, …)
98
Motivation
Application NoC
Application
● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,
traffic limiter, routing, …)
Pessimistic estimates à Larger tasks WCET à Less efficient platform usage
99
Contributions
NoC organization that reduces contention by partitioning
100
Contributions
NoC organization that reduces contention by partitioning
Timing analysis for the partitioned NoC
101
Contributions
NoC organization that reduces contention by partitioning
Timing analysis for the partitioned NoC
A method to configure the flow regulation on source nodes
102
Partitioning the NoC
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1
103
Partitioning the NoC
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1
104
Partitioning the NoC
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1
105
Partitioning the NoC
106
Partitioning the NoC
● Avoid horizontal communication● Cluster communicate with the closest I/O
subsystem● Each cluster sends messages via the I/O
subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the
I/O subsystem
● 8 identical NoC partitions
107
Partitioning the NoC
● Avoid horizontal communication● Cluster communicate with the closest I/O
subsystem● Each cluster sends messages via the I/O
subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the
I/O subsystem
● 8 identical NoC partitions
108
Partitioning the NoC
● Avoid horizontal communication● Cluster communicate with the closest I/O
subsystem● Each cluster sends messages via the I/O
subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the
I/O subsystem
● 8 identical NoC partitions
110
WCTT Analysis in the Partitioned NoCOverview
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
111
WCTT Analysis in the Partitioned NoCOverview
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
112
WCTT Analysis in the Partitioned NoCOverview
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
113
WCTT Analysis in the Partitioned NoCOverview
!
"#
"$A
%
&B
A
B
"'B
Cluster A
Cluster B
I/O System
I/O to Compute Cluster
A
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
114
WCTT Analysis in the Partitioned NoCOverview
!
"#
"$A
%
&B
A
B
"'B
Cluster A
Cluster B
I/O System
I/O to Compute Cluster
A
115
WCTT Analysis in the Partitioned NoC./!!00→56 (1)
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
116
WCTT Analysis in the Partitioned NoC./!!00→56 (1)
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
Flow Regulation Delay
117
WCTT Analysis in the Partitioned NoC./!!00→56 (1)
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
Flow Regulation Delay Round Robin Delay
118
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
119
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
120
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789:
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
121
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789:
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimal
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
122
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimal
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
123
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimalBuffer in router
overflows
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
124
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimalBuffer in router
overflows
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
125
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimalBuffer in router
overflows
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
126
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
127
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
128
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
129
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
RR-blocking of the last packet
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
130
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
RR-blocking of the last packet
Transmission of the last packet over the NoCwithout interference
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
131
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
RR-blocking of the last packet
Transmission of the last packet over the NoCwithout interference
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
● Two cases:
● Find #N+O such that the ./!!00→56 is minimum
● Find #N(P such that the buffer in >? does not overflow
132
Determining the parameters for the traffic limiter (1)
B@, 7
● Two cases:
● Find #N+O such that the ./!!00→56 is minimum
● Find #N(P such that the buffer in >? does not overflow
133
Determining the parameters for the traffic limiter (1)
B@, 7
134
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Time [cycles]
Dat
a [fl
it]
135
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Time [cycles]
Dat
a [fl
it]
136
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
137
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
138
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
139
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
140
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #
141
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #Binary Search, ILP, …
142
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #Binary Search, ILP, …
143
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #Binary Search, ILP, …
144
Evaluation
● Experiments evaluate different aspects of the work● Measurements on the Kalray MPPA platform● Case study of an engine management system
● All experiments based parameters of the Kalray MPPA● D-NoC packet payload = 62 flit● C-NoC packet payload = 2 flit● Header size = 4 flit● Router● Switching delay = 1 cycle● Channel delay = 1 cycle● Buffer size = 401 flit
145
Evaluation
146
EvaluationTotal Read Latency on the MPPA (1)
147
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
148
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
149
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
150
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
152
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
153
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
154
EvaluationTotal Read Latency on the MPPA (2)
0
10
20
30
40
50
60
1KB 2KB 4KB 8KB 16KB
Latencyinm
s
16Clusters- C0 16Clusters- C4 8Clusters- C0 8Clusters- C4
4Clusters- C0 4Clusters- C4 2Clusters- C0 2Clusters- C4
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
155
EvaluationSimulation based Case Study (1)
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
156
EvaluationSimulation based Case Study (1)
Clu
ster
Clu
ster I/O Subsys.
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
157
EvaluationSimulation based Case Study (1)
EMS
EMS
Clu
ster
Clu
ster I/O Subsys.
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
158
EvaluationSimulation based Case Study (1)
EMS
EMS
RC
Clu
ster
Clu
ster I/O Subsys.
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
159
EvaluationSimulation based Case Study (1)
EMS
EMS
RC
Reorder Core to manage memory requests
Clu
ster
Clu
ster I/O Subsys.
160
EvaluationSimulation based Case Study (2)
0
2000
4000
6000
8000
10000
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
Latency[cycle]
Analysis Max Isolation
Writing to memory./!!00→56
161
EvaluationSimulation based Case Study (2)
0
2000
4000
6000
8000
10000
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
Latency[cycle]
Analysis Max Isolation
0
2000
4000
6000
8000
10000
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
Latency[cycle]
Analysis Max Isolation
Writing to memory./!!00→56
Reading from memory./!!01230
+./!!56→00
● Shared NoC is one of the main sources for interference
● Difficult to analyze due to different architectural features
● Novel NoC partitioning scheme to reduce interference and easy analysis
● Tailored analysis for the partition● Configuration of traffic limiter to avoid buffer overflow
and to guarantee minimum transmission times
● Focus on the memory access within the I/O subsystem● Handling of requests affects the overall latency of
memory access
164
Conclusions and Future Work
Thank you for the attention!Questions?