TKTTKT--2431 Soc 2431 Soc DesignDesignTKTTKT--2431 Soc 2431 Soc DesignDesignLec 10 Lec 10 –– OnOn--chip communicationchip communication
ErnoErno SalminenSalminen, , TeroTero ArpinenArpinen
Department of Computer SystemsDepartment of Computer SystemsTampere University of TechnologyTampere University of TechnologyTampere University of TechnologyTampere University of Technology
Fall 2010Fall 2010
Copyright noticeCopyright notice Part of the slides adapted from slide set
by Alberto Sangiovanni-VincentelliEE249 t U i it f C lif i B k l course EE249 at University of California, Berkeley
http://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml by Timo D. Hämäläinen
M i O Chi Chi C i ti S C S i Managing On-Chip Chip Communications, SoC Symposium, Tampere 19.11.2003
#2/45
Copyright(2): Part of figures fromCopyright(2): Part of figures from L. Benini, G. De Micheli, Networks on chips: a new
SoC paradigm, Computer, Vol. 35, Iss. 1, Jan. 2002, pp 70 78pp. 70 -78.
V. Lahtinen, Design and Analysis of Interconnection Architectures for On-Chip Digital Systems, PhD Th i T U i i f T h lThesis, Tampere University of Technology, Department of Information Technology, June 2004. http://www.tkt.cs.tut.fi/research/daci/pub_open/lahtinen_thep p _ p _
sis.pdfWolf, W.; Jerraya, A.A.; Martin, G.; , "Multiprocessor
System-on-Chip (MPSoC) Technology," Computer-System on Chip (MPSoC) Technology, ComputerAided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct 2008
#3/45
Oct. 2008
Erno Salminen - Nov. 2010
ContentsContentsProblem statementPhysical limitationsPhysical limitationsNetwork-on-chip (NoC)ExtraExtra
See also: E Salminen A Kulmala T D Hämäläinen "Survey of Network-on-chip Proposals" white paper E. Salminen, A. Kulmala, T.D. Hämäläinen, Survey of Network on chip Proposals , white paper,
OCP-IP, [online]: http://www.ocpip.org/socket/whitepapers/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, 2008, 13 pages.
E. Salminen, A. Kulmala, T.D. Hämäläinen, "On Network-on-chip comparison", Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510. http://daci digitalsystems cs tut fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen
#4/45
http://daci.digitalsystems.cs.tut.fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen_-_On_Network-on-chip_compar.pdf&id=82519
At firstAt first
Make sure that simple things worksimple things work before even tryingbefore even trying more complex onesmore complex ones
#5/45
Problem Statement Problem Statement -- SoC ComplexitySoC Complexity SoC consists of heterogenous components Varying communication requirements/profiles Varying communication requirements/profiles Not all components communicate with each
otherSoC
other
Mem_1 Mem_N Periph_1 Periph_N
Communication networkCommunication network
#6/45
Proc_1 Proc_N Acc_1 Acc_N
Different requirementsDifferent requirements1. Varying Bandwidth (or throughput) Amount of data transferred in unit time, [MB/s] High requirement between CPU and memory Low requirement between CPU and peripheral
2 Diff t l t t ti2. Different latency expectations
M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N
CPU_1 Acc_NCPU_N Acc_1
#7/45
High BWLow BW
Characteristics of offered traffic foadCharacteristics of offered traffic foad1. Spatial: where the data go all sources similar?
2. Temporal: average data rate3. Temporal: when to transferp
a) Short bursts of high transfer activity and long periods of inactivity
b) T f ith t t i d i t lb) Transfers with constant sizes and intervals
very
data amountsrc
Spatial: Temporal:
a
c d
timebursty
time
moderately bursty
Spatial:
a) one dst: neighbor
b) one dst: some
c) few dst
#8/45
b
time
constant bitrate
c) few dst
d) send to allb
Basic metric: LatencyBasic metric: Latency
Delay between start of transfer and completionp time (last data ejected) – time (first data enters) [n cycles for transferring d words]
Interrupts usually require low latency Cache fills require low latencyCache fills require low latency Real-time systems require guaranteed
latency (always below some limit)latency (always below some limit) Stream data (voice, video) may require
constant latency (low jitter)
#9/45
constant latency (low jitter)
Measuring loadMeasuring load--latency behaviorlatency behavior Traffic generator mimics
IPs Sends dataSends data Receives data
One should 1 include the latency of1. include the latency of
network interface (NI)2. exclude the headers
when calculating traffic l dload
3. measure the latency of the whole transfers (which may be several packets.may be several packets. I.e. at lest one full packet, not just header latency)
4. include ”infinite” buffer at source to avoid throttling
#10/45
source to avoid throttling[Salminen, On the credibility of load-latency measurements, Soc, 2008]
Measured loadMeasured load--latency curvelatency curveNetwork saturates
when the traffic load t t hi hgets too high Latency approaches
infinityinfinityCertain bounds can
be derives analyticallyOf course, the goal
i i i l tis minimum latency and maximum saturation point
#11/45
saturation point[Salminen, On the credibility of load-latency measurements, Soc, 2008]
Physical limitationsPhysical limitations
ITRS 2003: InterconnectITRS 2003: InterconnectCChip crosship cross--sectionsectionppSeveral metal layers - less congestionHierarchical scaling
Wires on top levels are wider
Hierarchical scaling
levels are wider and taller than on lower levelson lower levels
Top layers for Power supply
transistors
Power supply Clock Global signals
#13/45
g
ITRS 2003: InterconnectITRS 2003: Interconnect HUOM! OBS!
Muy importante!
global signals
global signals withglobal signals with repeaters (bigger area and energy)
gate
local signals
gate
#14/45
Delay of global wires does not scale with technology
Several clock domainsSeveral clock domains Not possible/practical to use same clock in every
componentGALS – Globally asynchronous, locally synchronous
Components have local clocks Communication needs handshaking/synchronization Communication needs handshaking/synchronization
M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N
Proc_1 Proc_N Acc_1 Acc_N
#15/45
High freqLow freq
Energy breakdown forecastEnergy breakdown forecast
compare
#16/45
[Mattan Erez, Stream Architectures –Programmability and Efficiency,
Tampere SoC, Nov. 17 2004]
LocalizationLocalizationC i ti t b l li d t id l Communication must be localized to avoid long wires consume much energy
C i i
are slow, prone to error, cause routing congestionSeveral small components instead of few large Communication
between non-neighboring
tcomponents requires many hops
[Mattan Erez, Stream Architectures –
#17/45
Programmability and Efficiency, Tampere SoC, Nov. 17 2004]
Reliability problemsReliability problems ”Synchronization failures between clock
domains will be rare but unavoidable” - BeniniElectrical noise due to crosstalk,
electromagentic interference, radiation...gData errors or upsets, soft errorsData transfers become unreliable andData transfers become unreliable and
nondeterministicDesign needs both deterministic andDesign needs both deterministic and
stochastic models
#18/45
Achieving reliabilityAchieving reliability Today, designers use physical techniques to
overcome reliability problems Wire sizingWire sizing Length optimization Repeater insertion Shieldingg Data coding Bunch of others...Huge design effort requiredg g q
In (near) future, 100% reliability on physical level cannot be afforded anymore
Reliability muts be increased with additional HW or Reliability muts be increased with additional HW or SW layers Error detecting/correcting codes Retransmissions
#19/45
Retransmissions Request/acknowledge and time-out counters
NetworkNetwork--onon--chip (NoC)chip (NoC)
NetworkNetwork--onon--Chip (NoC)Chip (NoC) Communication network on chip NoC motivation NoC motivation1. High fab cost and effort in traditional VLSI Design general-purpose platform Design general purpose platform
2. Flexibility - For changing application needs3 Concurrency in transfers3. Concurrency in transfers4. Only short signal wires due to power and
delay problemsdelay problems5. On-chip wires are no longer reliable Us all packet s itched m lti hop net ork
#21/45
Usually packet-switched, multi-hop network
Differences betweenDifferences betweenMultiprocessors and SoCMultiprocessors and SoCpp
Multiprocessor systems (past) System-on-Chip (portable device)Scaleability important after fab (increase Scaleability an issue only at design timeScaleability important after fab (increase nodes)
Scaleability an issue only at design time (reuse, easy addition of nodes)
Load balancing and even distribution of computation important for maximum performance
Energy consumption important, idle nodes must be shut down
p
Communication network used as means of balancing computation and communication (both adjusted for optimal performance)
Computation might already be fixed per node (functional partition) Network serves nodes (only network adj sted)performance) adjusted)
Dataflow computing Computation is very heterogeneous, both dataflow and control style
In principle any node can compute a Execution of various applications clustered given task within SoC (specialized nodes)
Some research seems to be ”Re-inventing the wheel” New challenge: Energy saving combined to
Much experience and well established reasearch of routing, switching, scaleability, tailoring according to
#22/45
past multiprocessor researchapplications
Micronetwork protocol stackMicronetwork protocol stack Layers are specialized and optimized according to
application (domain)
abstraction
Splitting long transfer
HW dependent SW
Arbitration, packetization to increase reliabilityRouting
Splitting long transfer into packets, reordering
Arbitration, packetization to increase reliability
#23/45
NoC terminologyNoC terminology Processing elements exchange messages Network interface converts messages to/from
network-specific packets/streams Packet consists of several flits (≈words)
Routers communicate via ports and ports on the
agent(0) communication network
Routers communicate via ports, and ports on the boundary of the whole network are called terminals
processing element
network interfacerouter(0) router(1)message
pktpktfl fl fl
router(2)
(degree=4)
agent(1) linkAbbreviations:fl = flit, flow ctrl unitph = phit, physical unit
port
ph
phor stream
#24/45
( g )ph
p p , p ypkt =packet ph phph
terminal
Design Design choiceschoices of of NoCNoC Basic considerations deal with1. Structure1. Structure topology – logical sturcture routers and links
(floorplan defines the physical layout)
router design2. Control routing – which way to take flow control and switching – when to transmit
#25/45
Homogeneous networkHomogeneous network replication effect solve realization issues
once and for all less flexible
P bl i if i Problematic if processing units are heterogeneous assumes uniform size for assumes uniform size for
components and hence either
a) wastes areaa) wastes area b) components have to be
splitted
#26/45
H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.
Heterogeneous networkHeterogeneous networkcommon in contemporary SoCsbetter fit to application domain – betterbetter fit to application domain better
performance components are not components are not
uniformly sized hierarcahical hierarcahical
structure Are ASICs possible Are ASICs possible
in the future anymore?
#27/45
yH. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.
Network topologyNetwork topologyDefines the components (e.g. routers) p ( g ) the connections (e.g. each router connected to 4
neighbours)Vast number of topologies proposed in
literature – but there’s no free lunch!
b=bus hb=hierarchical bus r=ringp=point-to-point
#28/45
ft=fat-treex=crossbar c=customt=2-D torus
Network topology (2)Network topology (2)Can be modeled with graphs node = router (+processing unit)( p g ) edge = data stream
Number of nodes denoted with NAverage path length L Avg num of edges between all nodes in graphg g g p Small L desired for small latency
Average degree <k>g g Avg. num of edges in each switch Large <k> may decrease L but implementation
#29/45
gets more complex also
Metric: Bisection bandwidthMetric: Bisection bandwidthWhen design is partitioned into two (nearly)
equal halves, it is the minimum number of i hi h t b t th h lwires which must cross between the halves
considering all possible partitions Number of nodes in halves differs at most by 1 Number of nodes in halves differs at most by 1 Also other definitions...
High number means higher number ofHigh number means higher number of possible routes and hence increased bandwidth, flexibility and possibly fault-t ltoleranceShould increase with the number of nodes in
scalable networks
#30/45
scalable networks
Generic routerGeneric routerForwards data from input ports to outputsFIFOs can be on either side of the crossbar 1 FIFO per port is the most common virtual channels allow multiple FIFOs per port
generic router
Area and delay increase reapidly with the number of ports
generic router
routing arbitrator
.
......
nput
por
tsoutput port
FIFOscrossbar
...
#31/45
in ts...
Routing algortihmRouting algortihm Selects route from source to destination1. Deterministic
S Same route always used between source and destination e.g. 2-D mesh: first find correct row, then correct column All packets arrive in-order One blocked (or faulty) link/router, blocks all packets on
that route2. Adaptivep
Route varies according to blockage Better performance (at least when reordering neglected) Better faul-tolerance Better faul tolerance Deadloack avoidance needs extra care
Data may arrive out-of-order Reordering buffers required at receiver
#32/45
Reordering buffers required at receiver Buffers may consume large area/energy
SwitchingSwitching1. Store-and-forward switching
Data forwarded when whole packet received Whole packet buffered increases area and latency increases area and latency
2. Virtual cut-through: Data forwarded ASAP Whole packet buffered if output blocked
3. Wormhole: Data forwarded ASAP Buffer sizes can be independent of the packet size Reserves the whole transfer path and hence increases contention Reserves the whole transfer path and hence increases contention
Some schemes drop packets when contention is high Highly undetermistic Acknowledges required (roundtrip latency, buffers for retransfers) Not recommended in general Not recommended in general
Buffering has big impact on NoC performance and router area
#33/45
Quick terminology quizQuick terminology quizWhat is in common with the following terms? Koala bear Whale fish (valaskala in Finnish) Wormhole routing
Such things do not exist although many people talk about them Koala is marsupial Whale is mammal Wormhole is switching policy
#34/45
Example topologiesExample topologies
(Shared multimaster) bus(Shared multimaster) bus Bus = set of signals
connected to all devices Sh d Shared resource
One connection between devices reserves the whole interconnection
Single busN = 16L 1interconnection
Bandwidth shared among devices
L = 1<k> = -
Bandwidth may be scaled by adding links
Most common SoC network M lti l b
Low implementation costs, simpleL i l li bl ti
Multiple busN = 16L = 1
<k> = -
#36/45
Long signal lines problematic
Bus arbitration / addr decodingBus arbitration / addr decoding Arbitration decides which master can use the
shared resource (e.g. bus or memory)( g y) Single-master system does not need arbitration E.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priority May be pipelined with previous transfer
Decoding is needed to determine the target Central / Distributed schemes Address and Data are broadcast to every node Decoder select which read the data or respond
#37/45
Centralized / DistributedCentralized / Distributed
M1 M2 M3 A2 A3A1
Arbiterarbiter/decoder
arbiter/decoder
arbiter/decoderrequest +
grant
S1 S2 S3
Decoderarbiter/decoder
arbiter/decoderS1 S2 S3
A4
decoder
A5
decoder
select
M = masterS = slave
a) Centralized b) Distributed
Fi 2 C t li d di t ib t d t l
#38/45
Figure 2. Centralized vs. distributed control
Complex bus topologiesComplex bus topologies Hierarchical bus - Several bus
segments connected with bridges Fast access as long as the target is in
Hierarchical bus (chain)N = 16L = 2 3g g
the same segment Requires locality of accesses
Theoretical max. speed-up = num of segments
L = 2.3<k> = 2
segments Segments either circuit or packet-
switched together Packet-switching provides more Hi hi l b ( h i Packet switching provides more
parallelism with added buffering Split-bus
No data storage – only three-state
Hierarchical bus (chain + tree)
N = 16L = 2.1
<k> = 2.5
buffers If switches are non-conducting,
smaller effective capacitance and, hence smaller energy
A A A
#39/45
hence, smaller energy
Split-bus
A A A
Other topologiesOther topologies
RingN = 16L = 6.3<k> = 3
3D hypercube
Fully connected, point-to-point networkN = 16L = 1
<k> =
hypercubeN = 8
L = 3.7<k> = 8
<k> = -
Highest performance Clearly not scalable
3-D topologies are hard to map on 2-D
Simple layout Unidirectional ring may
result in long latency
#40/45
Clearly not scalable approach
hard to map on 2 D silicon die
g y Good for pipelines
Topologies: mesh and torusTopologies: mesh and torus2-D mesh and torus are very popularSimple layout for uniformly sized nodesSimple layout for uniformly sized nodes Wrap-around wires in torus need special
attention
2-D mesh
#41/45
2 D meshN = 16L = 4.7<k> = 4
2-D torusN = 16L = 4.1<k> = 5
Topologies: TreeTopologies: Tree Trad. tree has bisection
bandwidth=1 Bottleneck for uniform
traffic Does not matter when the
Rooted, complete, binary tree
N = 16L = 6 5
traffic is localized
Fat-tree has more (or wider) links near root
L = 6.5<k> = 2.9
wider) links near root Becoming more popular as
NoC topology
Trees also constructed so that each node is processing node
Fat tree with butterfly elements and fanout of 2 (binary fat tree)
N = 16L = 6.5
#42/45
processing node <k> = 3.5
Topologies: static analysisTopologies: static analysis Some basic properties may be analyzed statically Simulation with real applications preferred (i.e. dynamic analysis)
N t k N b f N b f Li kN t k P ll l L t Bi ti Li k Network Number of switches
Number of wires
Links
Single bus 0 1 Bi
Multiple bus 0 e Bi
Hierarchical bus (chain) e 1 e Bi
Network Parallel transactions
Longest path
Bisection bandwidth
Links
Single bus 1 1 1 Bi
Multiple bus e (e ≤ N) 1 e BiHierarchical bus (chain) e-1 e Bi
Crossbar N2/4 N2/2 Bi
One-sided crossbar N2/2 N2-N/2 Bi
Binary tree N-1 2(N-1) Bi
Hierarchical bus (chain) e (e ≤ N) e (e ≤ N) 1 Bi
Crossbar N N N-1 Bi
One-sided crossbar N 2N-1 N/2 Bi
Binary tree N 2log2N 1 BiFat tree (fanout 2) Nlog2N 2Nlog2N Bi
Ring N 2N Bi
3-D hypercube N N+(N/2)log2N Bi
Binary tree N 2log2N 1 Bi
Fat tree (fanout 2) N 2log2N N Bi
Ring N N/2+2 2 Bi
3-D hypercube N log2N+2 N/2 Bi2-D mesh N 3N-2N1/2 Bi
2-D torus N 3N Bi
Point-to-point, fully connected
0 (N2-N)/2 Bi
2-D mesh N 2N1/2 N1/2 Bi
2-D torus N N1/2+2 2N1/2 Bi
Point-to-point, fully connected
N 1 (N/2)*(N/2) Bi
#43/45
Omega network (MIN) (N/4)(log2N-1) (N/2)log2N UniOmega network (MIN) N/2 log2N N Uni
Lahtinen 2004: Table 3.2 Performance Lahtinen 2004: Table 3.3 Implementation costs
DaytonaDaytona (2001), OMAP (2004), (2001), OMAP (2004), MPCoreMPCore(2005)(2005)( )( )
Single bus
Two buses
Single bus
#44/45
W. Wolf. et al. , "Multiprocessor System-on-Chip (MPSoC) Technology," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct. 2008
Industrial Industrial exampleexample: : ViperViper byby Philips Philips (2001)(2001)( )( )
Four buses
#45/45
S. Dutta et al., "Viper: A multiprocessor SOC for advanced set-top box and digital TV systems," Design & Test of Computers, IEEE , vol.18, no.5, pp.21-31, Sep-Oct 2001
ST ST NomadikNomadikST ST NomadikNomadik(2003) (2003)
Multiple buses
#46/45 Erno Salminen - Nov. 2010
CellCell BE BE byby IBM/Sony/Toshiba (2005)IBM/Sony/Toshiba (2005) Khunjush, F.; Dimopoulos, N.J.; , "Extended characterization of DMA transfers on the Cell BE processor,"
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on , vol., no., pp.1-8, 14-18 April 2008
See aldo: D. Shippy, M. Phipps, The Race for a New Game Machine: Creating the Chips Inside the XBox 360 and the Playstation 3 Citradel 2009and the Playstation 3, Citradel, 2009
Four rings
#47/45 Erno Salminen - Nov. 2010
Tile64 Tile64 byby TileraTilera (2008)(2008)2-D mesh with 4 DDR controller for extrnal
memoriesTile = 3-wide 32b VLIW, 750 MHz90nm, 615M tran, 11W90nm, 615M tran, 11W
S. Bell et al., TILE64 -Processor: A 64-Core SoCwith Mesh Interconnect, ISSCC 2008
#48/45 Erno Salminen - Nov. 2010
Faust (Faust (2009)2009)
M difi d 2 DModified 2-D mesh, asynchnoronousNoC
[E. Beigne et al., An Asynchronous Power Aware and Adaptive NoCBased Circuit, JSSC, 2009]
#49/45 Erno Salminen - Nov. 2010
ConclusionConclusionSoC has many components, different
requirementsWire delays and power consumption
becoming very problematicBi diff b t l l d l b l (Big difference between local and global (or off-chip) communicationFully synchronous approach becomingFully synchronous approach becoming
unfeasibleNetwork-on-chip = multi-hop on-chip networkNetwork on chip multi hop on chip network Often packet-switched Buffering, routing, and topology are important
#50/45 Erno Salminen - Nov. 2010
design decisions
NoCNoC SurveySurveyNoteNote: : AllAll slidesslides in in thisthis set set areare lecturelecturematerialmaterial!!
Erno Salminen - Nov. 2010
Survey of NetworkSurvey of Network--onon--chip proposals chip proposals [2008][2008][ ][ ]
This paper gives an overview of state-of-the-art regarding the network-on-chip (NoC) proposals.
NoC paradigm replaces dedicated, design-specific wires withNoC paradigm replaces dedicated, design specific wires with scalable, general purpose, multi-hop network. Numerous examples from literature are selected to highlight the contemporary approaches and reported implementation results. Th j t d f N C h d t th t iThe major trends of NoC research and aspects that require more investigations are pointed out.
A packet-switched 2-D mesh is the most used and studied topology so far It is also a sort of an average NoC currentlytopology so far. It is also a sort of an average NoC currently. Good results and interesting proposals are plenty.
However, large differences in implementation results, vague documentation and lack of comparison were also observeddocumentation, and lack of comparison were also observed.
http://www.ocpip.org/uploads/documents/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf
#52/45 Erno Salminen - Nov. 2010
Basic NoC propertiesBasic NoC properties
--- clip clip (39 lines omitted in the slide show)---
#53/45 Erno Salminen - Nov. 2010
NoC implementationsNoC implementations
--- clip clip (14 lines omitted in the slide show)---
#54/45 Erno Salminen - Nov. 2010
Average NoC 2008Average NoC 2008
#55/45 Erno Salminen - Nov. 2010 [Salminen et al. Survey of NoC proposals, OCP-IP, 2008]
Average NoC 2008 (2)Average NoC 2008 (2)
#56/45 Erno Salminen - Nov. 2010
as[Salminen et al. Survey of NoC proposals, OCP-IP, 2008]
Case StudyCase StudyCase StudyCase Study
Managing Interconnection Complexity in Managing Interconnection Complexity in Heterogeneous IP Block InterconnectionHeterogeneous IP Block Interconnection(HIBI)(HIBI)(HIBI)(HIBI)
Erno Salminen - Nov. 2010
Overview of Managing OnOverview of Managing On--Chip Chip CommunicationsCommunications
Dedicated point-to-point links
Simple Alwaysguaranteed
LimitedLimited IP block specificyyp
Single bus
nts
nts
WW exib
ility
exib
ility
ss ee
elem
enel
emen
cy&
BW
cy&
BW
ty &
Fle
ty &
Fle
bloc
ksbl
ocks
rk re
use
rk re
use
Hierarchical bus structures
Regular multi-hop topologies et
wor
k et
wor
k
Late
ncLa
tenc
alea
bilit
alea
bilit
# of
IP
# of
IP
Net
wor
Net
worstructures
topologies
Customized multi-hop Verycomple
Designonce
Generalp rpose
Best-effort/Predictable
Ne
Ne
Sca
Sca NN
Arbitrar
#58/45 Erno Salminen - Nov. 2010
p complex oncepurposePredictable Arbitrary
Lessons LearnedLessons LearnedMany communication networks have been studied in
TUT On chip communication research started 1997 On-chip communication research started 1997
A regular topology can well be fitted to algorithm specific comp/comm balanced implementationIn general case there is no optimal topology
Communication-centric design was successfully conducted for performanceconducted for performanceImportant to exploit features of application(s) to optimize interconnection
Established parallel processing doctrines can be applied to SoCSoC challenge is heterogeneity in computation
#59/45 Erno Salminen - Nov. 2010
SoC challenge is heterogeneity in computation
Interconnection Implementation ViewInterconnection Implementation View Make lowest level data transfer mechanisms simple and
efficient Minimum number of signalsg “Every clock edge carries useful data in transaction”
Perform all high-level operations on basic mechanisms Layered protocol model, OCP compatibley p , p Message passing
Use identical HW modules to compose overall interconnection Translate IP specific communication operations to networka s ate spec c co u cat o ope at o s to et o Support all (practical) topologies No limits to number of IP blocks (whole design) Support (re-)configurabilitypp ( ) g y Fit to all communication needs –from memories to peripherals
“Gives body to build interconnect”“Gives body to build interconnect”
#60/45 Erno Salminen - Nov. 2010
System Design ViewSystem Design View Make interconnection aware of application functionality
A) System design time Communication profiled from application processes Communication profiled from application processes Clustering: localization of communication Allocation of communication resources (segments, buffers) Optimization of non-reconfigurable parameters Optimization of non reconfigurable parameters Initial QoS and other transfer parameters
B) Run time Utilize knowledge of predictable communication events if Utilize knowledge of predictable communication events if
available Guaranteed QoS in transfers
Track communication –change QoS & other parameters if required
Totally change mode of operation if required HIBI Design Flow is 80% of the HIBI interconnect scheme
#61/45 Erno Salminen - Nov. 2010
“Gives brains to the communication”“Gives brains to the communication”
HIBI Identical Interconnection ModulesHIBI Identical Interconnection Modules
HIBI wrapper is the only building block used everywhere in interconnectiony Between network and IP-blocks Between network segments Wrapper is parametrizable, modular, and
configurableA FIFO b ff i Asyncronous FIFO buffering
HIBI network
HIBIWrapper
FIFO / OCP i t f
HIBIwrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
#62/45 Erno Salminen - Nov. 2010
P1 Mem1PN Acc1... AccN...... MemN
interface
IP
HIBI NetworkHIBI Network HIBI network consists of bus segments and bridges
Transfers in segment synchronous circuit switched Transfers across bridges asynchronous packet switched Scales from serial point-to-point link to an arbitrary
topologyp gy
Identical signals between wrappers in network side No dedicated point-to-point signals
All i l h d i hi k All signals shared within network segment Wrapper layout is independent of the number of agents
Totally distributed arbitrationTotally distributed arbitration No central arbiter Each wrapper is aware of communication details
#63/45 Erno Salminen - Nov. 2010
HIBI Network Example
rr rr
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCKIP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper Bridge
HIB
IH
IBI
Wra
ppe
Wra
ppe
HIB
IH
IBI
Wra
ppe
Wra
ppe
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapperpppp
IP BLOCKIP BLOCKIP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
pppp
IP BLOCKIP BLOCK
pppp
IP BLOCKIP BLOCK
pppp
IP BLOCKIP BLOCK
pppp pppp
IP BLOCKIP BLOCK
Clock domainClock domain
#64/45 Erno Salminen - Nov. 2010
Bus latencyBus latency Total latency consists of several phases From: K. Kuusilinna, PhD Thesis, TUT, 2001.
Action Available MethodsAction Available MethodsRequest bus ownership
Wait for higher priority transactions to complete / Arbitrationrb
itrat
ion
tenc
y
Central arbiter, daisy chain, wired-OR,connectionless arbitration
Round-robin, hierarchical round-robin,time-slot, fixed priority, adaptive
Waiting time may be long during high contection
Bus ownership granted
complete / ArbitrationA
rla
t time slot, fixed priority, adaptive
(See Request)
Begin transaction Address/data multiplexing,handshaking
contection
Until all data has been transferred ora limit for data transfers per burst is reached.
Wait for master ready /Wait for target ready
Transfer first data
Initi
alla
tenc
y
a ds a g
p
Transfer data
Wait for master ready /Wait for target ready
Subs
eque
ntda
ta la
tenc
y
Optimizing this phase has biggest impact in long transfers
#65/45 Erno Salminen - Nov. 2010
Drive or wait for the bus to settle to idle state
Turn
-aro
und
late
ncy
Figure: Bus latency
transfers
HIBI Quality of ServiceHIBI Quality of ServiceTDMA (time division multiple access) with
freely run-time adjustable frame length and y j gslot durations and allocationsRe-synchronization to application phasey pp pAlso traditional priority/round-robin
time frametime frame time frametime frame
allocated time slotA1
competitionA3 A2 A3 A1 A3 t
competition
A3A2
A3A1
A1 A2 A1 A3 A1
Priority
Round-robin
tA2 A3 A1
#66/45 Erno Salminen - Nov. 2010
A2A1 A2 A3 A1 A2 A3 t
HIBI Basic TransferHIBI Basic TransferPipelined with arbitrationSplit-transactionsSplit transactionsBurst transfersNo wait cycles allowedNo wait cycles allowedNon pre-emptive transfers QoS is guaranteed with TDMA or with a QoS is guaranteed with TDMA or with a
combination of Send Max+Priority/RoundRobinpipeline
rq addr
ret addr
addr
data
w addr
w data ret dataw data
w addr rq addr ret addr
rq data rq data
ret addr ...
#67/45 Erno Salminen - Nov. 2010
t
ret addrdata w data ret dataw data rq data rq data
split transaction
HIBI Wrapper Structure (v.2)HIBI Wrapper Structure (v.2)
IP signals in IP signals out
HI prior tx FIFO
LO prior tx FIFO
HI prior rx FIFO
LO prior rx FIFO
M D
Config memTx FSM
Mux Demux
Addr decoderRx FSM
#68/45 Erno Salminen - Nov. 2010
HIBI signals out HIBI signals in
Wrapper Configuration MemoryWrapper Configuration Memory Stores all information for distributed arbitration
Permanent: ROM, 1 page Semi run-time configurable: ROM with several pages Full run-time configurable: RAM, with pages
Curr page
Curr conf
C f
Newconf
values
Dem Mux
Time slot
valuesConf page
Timeslot
mux
#69/45 Erno Salminen - Nov. 2010
logicslotsignalsCycle counter
HIBI Wrapper Area in ASICHIBI Wrapper Area in ASIC
35 000
25 000
30 000
35 000
RAMROM
15 000
20 000
Area
[gat
es]
ROM
5 000
10 000
A
08 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b
lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0
lo prior FIFOs = 5 / 5hi prior FIFOs = 5 / 5
lo prior FIFOs = 10 / 5hi prior FIFOs = 10 / 5
#70/45 Erno Salminen - Nov. 2010
1-page mem 1-page mem 2-page mem
Runtime comparisonRuntime comparisonSalminen et al., SAMOS 2005.
#71/45 Erno Salminen - Nov. 2010
OtherOther notesnotes on on NoCNoC
Erno Salminen - Nov. 2010
Network topology categoriesNetwork topology categories1. Static networks utilize only point-to-point or
shared connection lines2. Dynamic networks use switches (or routers)
for communicationa) Direct = each processing node connected to
switchb) Indirect = some switches are not connected
directly to any processing node
#73/45 Erno Salminen - Nov. 2010
Problems with Current NoC DiscussionProblems with Current NoC DiscussionWhat is ”NoC” – no common definition
Something new, good by definition (needs no proof),...General purpose – but to what extentGeneral purpose – but to what extent
Arbitrary connectivity between any node? Uniform overall transfer distribution?
Discussion about “optimal topology” Discussion about optimal topology Multiprocessor architectures for scientific computations? Can massive fine-grain granularity parallelism be utilized in
realistic SoC applications?realistic SoC applications? Copying computer network ideas without criticism
In-network data buffering, routing tables and algorithms Compare to current TCP/IP or past ATM routers! Compare to current TCP/IP or past ATM routers!
Toy test case applications Billion transistors – executes single FFT? Common benchmarks should be designed!
#74/45 Erno Salminen - Nov. 2010
Common benchmarks should be designed!
Wiring hierarchyWiring hierarchy How far can signal reach
in one local clock cycle? Depends on Depends on
frequency (i.e duration of clock cycle)
Wiring parameters (layer
l b l
Wiring parameters (layer, width, height, density, shielding)
Not far anyway global
intermediate
Not far anyway... Global wires will function
as lossy transmission linesRC d l f d
local
RC models of today become inaccurate
3-D modeling s-l-o-w and difficult
#75/45 Erno Salminen - Nov. 2010
[H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003]
difficult
Crosstalk impactCrosstalk impactLong fast switching wires
Long wires close to each otherLong, fast switching wires
Switching on neighbor g gwires affects delay
Delay on wire 4 shown in table 2
#76/45 Erno Salminen - Nov. 2010
P. Liljeberg et al., Self-timed Approach for Noise Reduction in Noc, in “Interconnect-centric design for advanced SoC and NoC”, Kluwer. 2004
Transaction latency components
#77/45 Erno Salminen - Nov. 2010
Scalable Multiprocessors, lecture slides, http://www.cs.princeton.edu/courses/archive/spr07/cos598A/
Impact of DMAImpact of DMA agent
CPU core data mem DMA network i)CPU core
instr. mem
data mem DMA interface
other perihp.
ii)
comp comp ...w/o DMAa) short comm time
compw/ DMA
)
comp comm compw/o DMA
comp ...
comp comm comp ...w/o DMA
compcommcomp
comm...compw/ DMA
b) equal comp and comm time
comm comm ...w/o DMA
w/ DMA
c) long comm time
#78/45 Erno Salminen - Nov. 2010
comm comm...w/ DMA
Retransfer buffersRetransfer buffers If packets are dropped or corrupted in delivery (usually) they have to
retransferred Variable latencies problematic: is packet dropped and just havinf longer latency If Time-out latency exceeded , packet is assumed to be missingy p g
Source must store packets until it recieves acknowledge of succesfull transfer Sending acknowledge after each packet results in small buffer but (at least)
double latency Sengin ack after each N packet reuires bigger buffers but gives better g p gg g
performance
source destination
ack (ok)a) ack for each packet
src
buf
dst Latency per pkt = send_latency + ack_latency
b) ack for each N src dst
Latency per pkt =
#79/45 Erno Salminen - Nov. 2010ack (ok,ok,fail,ok)
each Npackets
src
buf buf buf buf
dst (N*send_latency + ack_latency) / N
Reordering buffersReordering buffers Packets arriving Out-of-order may require huge reordering buffers
Sometimes processing units may accept out-of-order delivery or buffers can be integrated with internal memory of the processing unit
If ack is sent after 4 packets buffer for 4 packets is needed If ack is sent after 4 packets, buffer for 4 packets is needed Furthermore, separate buffers are needed for each source as data may
received in interleaved manner E.g. (pkt_<n>_<src>) received: pkt_1_1, pkt_4_1, pkt_4_2, pkt_3_3... E if k t ft N k t d S E.g. if ack sent after N apckets and S sources
reorder buffer size = N*S packets
source 0 destination source 1source 0 destination
ack (ok)a) ack for each packet
src dst
bufAck forces in-order delivery
source 1
b) ack for each N
dst
buf buf buf bufsrc buf buf buf buf
#80/45 Erno Salminen - Nov. 2010
ack
each Npackets
src buf buf buf buf
buf buf buf buf...
ack
Buffer reservationBuffer reservation
Notification ofSender agent Receiver agent Sender agent Receiver agent
Notification of the next tx
Reserve buffer Notification of the reserved buffer
Reserve buffer
Configure rx DMA
ACK
Configure rx DMA
Actual data
(optional ACK)
Actual data
(copy data)
C d t
Observedtx duration
Consume data
Reserve buffer etc.
Consume data
Observedtx duration
#81/45 Erno Salminen - Nov. 2010
Intertiwned/ReorderingIntertiwned/Reordering Transfers from different
sources may arbitrarily i t t i d
destination0
i) fixed-length packets
intertwined In addition, packets may
arrive out-of-order...
ddee
aabbcc
dd aa bb eecc
from
net
wor
k
arrive out-of-order
source0
”FIFO”-like buffers
ii) variable-length packets
netw
ork
source1destination0
aabbcc
ddee ddaabbeecc
destination0
netw
ork
...dd ee
dd aa bb eecc
These are either single words, bursts, or packets, depending on
the network
from
cc
linked list buffers
aa bb
#82/45 Erno Salminen - Nov. 2010
Irregular IP sizeIrregular IP size IP’s tend to have irregular size and shape Largest IP per row/column decides its height/width
S Some space wasted links will have varying length
Reordering the IPs reduces areag Ensure that frequently communicating IPs are still close to
each other
#83/45 Erno Salminen - Nov. 2010
<19.5% reduction in area>
Customized meshCustomized meshConnect more than IP to one routerSomewhat smaller bandwidth available per IPSomewhat smaller bandwidth available per IP Usually enough, though
Adopt totally customized topology (theAdopt totally customized topology (the rightmost fig)
#84/45 Erno Salminen - Nov. 2010