TKT-2431 Soc Design · TKT-2431 Soc Design ......

TKTTKT--2431 Soc 2431 Soc DesignDesignTKTTKT--2431 Soc 2431 Soc DesignDesignLec 10 Lec 10 –– OnOn--chip communicationchip communication

ErnoErno SalminenSalminen, , TeroTero ArpinenArpinen

Department of Computer SystemsDepartment of Computer SystemsTampere University of TechnologyTampere University of TechnologyTampere University of TechnologyTampere University of Technology

Fall 2010Fall 2010

Copyright noticeCopyright notice Part of the slides adapted from slide set

by Alberto Sangiovanni-VincentelliEE249 t U i it f C lif i B k l course EE249 at University of California, Berkeley

http://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml by Timo D. Hämäläinen

M i O Chi Chi C i ti S C S i Managing On-Chip Chip Communications, SoC Symposium, Tampere 19.11.2003

#2/45

Copyright(2): Part of figures fromCopyright(2): Part of figures from L. Benini, G. De Micheli, Networks on chips: a new

SoC paradigm, Computer, Vol. 35, Iss. 1, Jan. 2002, pp 70 78pp. 70 -78.

V. Lahtinen, Design and Analysis of Interconnection Architectures for On-Chip Digital Systems, PhD Th i T U i i f T h lThesis, Tampere University of Technology, Department of Information Technology, June 2004. http://www.tkt.cs.tut.fi/research/daci/pub_open/lahtinen_thep p _ p _

sis.pdfWolf, W.; Jerraya, A.A.; Martin, G.; , "Multiprocessor

System-on-Chip (MPSoC) Technology," Computer-System on Chip (MPSoC) Technology, ComputerAided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct 2008

#3/45

Oct. 2008

Erno Salminen - Nov. 2010

ContentsContentsProblem statementPhysical limitationsPhysical limitationsNetwork-on-chip (NoC)ExtraExtra

See also: E Salminen A Kulmala T D Hämäläinen "Survey of Network-on-chip Proposals" white paper E. Salminen, A. Kulmala, T.D. Hämäläinen, Survey of Network on chip Proposals , white paper,

OCP-IP, [online]: http://www.ocpip.org/socket/whitepapers/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, 2008, 13 pages.

E. Salminen, A. Kulmala, T.D. Hämäläinen, "On Network-on-chip comparison", Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510. http://daci digitalsystems cs tut fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen

#4/45

http://daci.digitalsystems.cs.tut.fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen_-_On_Network-on-chip_compar.pdf&id=82519

At firstAt first

Make sure that simple things worksimple things work before even tryingbefore even trying more complex onesmore complex ones

#5/45

Problem Statement Problem Statement -- SoC ComplexitySoC Complexity SoC consists of heterogenous components Varying communication requirements/profiles Varying communication requirements/profiles Not all components communicate with each

otherSoC

other

Mem_1 Mem_N Periph_1 Periph_N

Communication networkCommunication network

#6/45

Proc_1 Proc_N Acc_1 Acc_N

Different requirementsDifferent requirements1. Varying Bandwidth (or throughput) Amount of data transferred in unit time, [MB/s] High requirement between CPU and memory Low requirement between CPU and peripheral

2 Diff t l t t ti2. Different latency expectations

M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N

CPU_1 Acc_NCPU_N Acc_1

#7/45

High BWLow BW

Characteristics of offered traffic foadCharacteristics of offered traffic foad1. Spatial: where the data go all sources similar?

2. Temporal: average data rate3. Temporal: when to transferp

a) Short bursts of high transfer activity and long periods of inactivity

b) T f ith t t i d i t lb) Transfers with constant sizes and intervals

very

data amountsrc

Spatial: Temporal:

a

c d

timebursty

time

moderately bursty

Spatial:

a) one dst: neighbor

b) one dst: some

c) few dst

#8/45

b

time

constant bitrate

c) few dst

d) send to allb

Basic metric: LatencyBasic metric: Latency

Delay between start of transfer and completionp time (last data ejected) – time (first data enters) [n cycles for transferring d words]

Interrupts usually require low latency Cache fills require low latencyCache fills require low latency Real-time systems require guaranteed

latency (always below some limit)latency (always below some limit) Stream data (voice, video) may require

constant latency (low jitter)

#9/45

constant latency (low jitter)

Measuring loadMeasuring load--latency behaviorlatency behavior Traffic generator mimics

IPs Sends dataSends data Receives data

One should 1 include the latency of1. include the latency of

network interface (NI)2. exclude the headers

when calculating traffic l dload

3. measure the latency of the whole transfers (which may be several packets.may be several packets. I.e. at lest one full packet, not just header latency)

4. include ”infinite” buffer at source to avoid throttling

#10/45

source to avoid throttling[Salminen, On the credibility of load-latency measurements, Soc, 2008]

Measured loadMeasured load--latency curvelatency curveNetwork saturates

when the traffic load t t hi hgets too high Latency approaches

infinityinfinityCertain bounds can

be derives analyticallyOf course, the goal

i i i l tis minimum latency and maximum saturation point

#11/45

saturation point[Salminen, On the credibility of load-latency measurements, Soc, 2008]

Physical limitationsPhysical limitations

ITRS 2003: InterconnectITRS 2003: InterconnectCChip crosship cross--sectionsectionppSeveral metal layers - less congestionHierarchical scaling

Wires on top levels are wider

Hierarchical scaling

levels are wider and taller than on lower levelson lower levels

Top layers for Power supply

transistors

Power supply Clock Global signals

#13/45

g

ITRS 2003: InterconnectITRS 2003: Interconnect HUOM! OBS!

Muy importante!

global signals

global signals withglobal signals with repeaters (bigger area and energy)

gate

local signals

gate

#14/45

Delay of global wires does not scale with technology

Several clock domainsSeveral clock domains Not possible/practical to use same clock in every

componentGALS – Globally asynchronous, locally synchronous

Components have local clocks Communication needs handshaking/synchronization Communication needs handshaking/synchronization

M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N

Proc_1 Proc_N Acc_1 Acc_N

#15/45

High freqLow freq

Energy breakdown forecastEnergy breakdown forecast

compare

#16/45

[Mattan Erez, Stream Architectures –Programmability and Efficiency,

Tampere SoC, Nov. 17 2004]

LocalizationLocalizationC i ti t b l li d t id l Communication must be localized to avoid long wires consume much energy

C i i

are slow, prone to error, cause routing congestionSeveral small components instead of few large Communication

between non-neighboring

tcomponents requires many hops

[Mattan Erez, Stream Architectures –

#17/45

Programmability and Efficiency, Tampere SoC, Nov. 17 2004]

Reliability problemsReliability problems ”Synchronization failures between clock

domains will be rare but unavoidable” - BeniniElectrical noise due to crosstalk,

electromagentic interference, radiation...gData errors or upsets, soft errorsData transfers become unreliable andData transfers become unreliable and

nondeterministicDesign needs both deterministic andDesign needs both deterministic and

stochastic models

#18/45

Achieving reliabilityAchieving reliability Today, designers use physical techniques to

overcome reliability problems Wire sizingWire sizing Length optimization Repeater insertion Shieldingg Data coding Bunch of others...Huge design effort requiredg g q

In (near) future, 100% reliability on physical level cannot be afforded anymore

Reliability muts be increased with additional HW or Reliability muts be increased with additional HW or SW layers Error detecting/correcting codes Retransmissions

#19/45

Retransmissions Request/acknowledge and time-out counters

NetworkNetwork--onon--chip (NoC)chip (NoC)

NetworkNetwork--onon--Chip (NoC)Chip (NoC) Communication network on chip NoC motivation NoC motivation1. High fab cost and effort in traditional VLSI Design general-purpose platform Design general purpose platform

2. Flexibility - For changing application needs3 Concurrency in transfers3. Concurrency in transfers4. Only short signal wires due to power and

delay problemsdelay problems5. On-chip wires are no longer reliable Us all packet s itched m lti hop net ork

#21/45

Usually packet-switched, multi-hop network

Differences betweenDifferences betweenMultiprocessors and SoCMultiprocessors and SoCpp

Multiprocessor systems (past) System-on-Chip (portable device)Scaleability important after fab (increase Scaleability an issue only at design timeScaleability important after fab (increase nodes)

Scaleability an issue only at design time (reuse, easy addition of nodes)

Load balancing and even distribution of computation important for maximum performance

Energy consumption important, idle nodes must be shut down

p

Communication network used as means of balancing computation and communication (both adjusted for optimal performance)

Computation might already be fixed per node (functional partition) Network serves nodes (only network adj sted)performance) adjusted)

Dataflow computing Computation is very heterogeneous, both dataflow and control style

In principle any node can compute a Execution of various applications clustered given task within SoC (specialized nodes)

Some research seems to be ”Re-inventing the wheel” New challenge: Energy saving combined to

Much experience and well established reasearch of routing, switching, scaleability, tailoring according to

#22/45

past multiprocessor researchapplications

Micronetwork protocol stackMicronetwork protocol stack Layers are specialized and optimized according to

application (domain)

abstraction

Splitting long transfer

HW dependent SW

Arbitration, packetization to increase reliabilityRouting

Splitting long transfer into packets, reordering

Arbitration, packetization to increase reliability

#23/45

NoC terminologyNoC terminology Processing elements exchange messages Network interface converts messages to/from

network-specific packets/streams Packet consists of several flits (≈words)

Routers communicate via ports and ports on the

agent(0) communication network

Routers communicate via ports, and ports on the boundary of the whole network are called terminals

processing element

network interfacerouter(0) router(1)message

pktpktfl fl fl

router(2)

(degree=4)

agent(1) linkAbbreviations:fl = flit, flow ctrl unitph = phit, physical unit

port

ph

phor stream

#24/45

( g )ph

p p , p ypkt =packet ph phph

terminal

Design Design choiceschoices of of NoCNoC Basic considerations deal with1. Structure1. Structure topology – logical sturcture routers and links

(floorplan defines the physical layout)

router design2. Control routing – which way to take flow control and switching – when to transmit

#25/45

Homogeneous networkHomogeneous network replication effect solve realization issues

once and for all less flexible

P bl i if i Problematic if processing units are heterogeneous assumes uniform size for assumes uniform size for

components and hence either

a) wastes areaa) wastes area b) components have to be

splitted

#26/45

H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.

Heterogeneous networkHeterogeneous networkcommon in contemporary SoCsbetter fit to application domain – betterbetter fit to application domain better

performance components are not components are not

uniformly sized hierarcahical hierarcahical

structure Are ASICs possible Are ASICs possible

in the future anymore?

#27/45

yH. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.

Network topologyNetwork topologyDefines the components (e.g. routers) p ( g ) the connections (e.g. each router connected to 4

neighbours)Vast number of topologies proposed in

literature – but there’s no free lunch!

b=bus hb=hierarchical bus r=ringp=point-to-point

#28/45

ft=fat-treex=crossbar c=customt=2-D torus

Network topology (2)Network topology (2)Can be modeled with graphs node = router (+processing unit)( p g ) edge = data stream

Number of nodes denoted with NAverage path length L Avg num of edges between all nodes in graphg g g p Small L desired for small latency

Average degree <k>g g Avg. num of edges in each switch Large <k> may decrease L but implementation

#29/45

gets more complex also

Metric: Bisection bandwidthMetric: Bisection bandwidthWhen design is partitioned into two (nearly)

equal halves, it is the minimum number of i hi h t b t th h lwires which must cross between the halves

considering all possible partitions Number of nodes in halves differs at most by 1 Number of nodes in halves differs at most by 1 Also other definitions...

High number means higher number ofHigh number means higher number of possible routes and hence increased bandwidth, flexibility and possibly fault-t ltoleranceShould increase with the number of nodes in

scalable networks

#30/45

scalable networks

Generic routerGeneric routerForwards data from input ports to outputsFIFOs can be on either side of the crossbar 1 FIFO per port is the most common virtual channels allow multiple FIFOs per port

generic router

Area and delay increase reapidly with the number of ports

generic router

routing arbitrator

.

......

nput

por

tsoutput port

FIFOscrossbar

...

#31/45

in ts...

Routing algortihmRouting algortihm Selects route from source to destination1. Deterministic

S Same route always used between source and destination e.g. 2-D mesh: first find correct row, then correct column All packets arrive in-order One blocked (or faulty) link/router, blocks all packets on

that route2. Adaptivep

Route varies according to blockage Better performance (at least when reordering neglected) Better faul-tolerance Better faul tolerance Deadloack avoidance needs extra care

Data may arrive out-of-order Reordering buffers required at receiver

#32/45

Reordering buffers required at receiver Buffers may consume large area/energy

SwitchingSwitching1. Store-and-forward switching

Data forwarded when whole packet received Whole packet buffered increases area and latency increases area and latency

2. Virtual cut-through: Data forwarded ASAP Whole packet buffered if output blocked

3. Wormhole: Data forwarded ASAP Buffer sizes can be independent of the packet size Reserves the whole transfer path and hence increases contention Reserves the whole transfer path and hence increases contention

Some schemes drop packets when contention is high Highly undetermistic Acknowledges required (roundtrip latency, buffers for retransfers) Not recommended in general Not recommended in general

Buffering has big impact on NoC performance and router area

#33/45

Quick terminology quizQuick terminology quizWhat is in common with the following terms? Koala bear Whale fish (valaskala in Finnish) Wormhole routing

Such things do not exist although many people talk about them Koala is marsupial Whale is mammal Wormhole is switching policy

#34/45

Example topologiesExample topologies

(Shared multimaster) bus(Shared multimaster) bus Bus = set of signals

connected to all devices Sh d Shared resource

One connection between devices reserves the whole interconnection

Single busN = 16L 1interconnection

Bandwidth shared among devices

L = 1<k> = -

Bandwidth may be scaled by adding links

Most common SoC network M lti l b

Low implementation costs, simpleL i l li bl ti

Multiple busN = 16L = 1

<k> = -

#36/45

Long signal lines problematic

Bus arbitration / addr decodingBus arbitration / addr decoding Arbitration decides which master can use the

shared resource (e.g. bus or memory)( g y) Single-master system does not need arbitration E.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priority May be pipelined with previous transfer

Decoding is needed to determine the target Central / Distributed schemes Address and Data are broadcast to every node Decoder select which read the data or respond

#37/45

Centralized / DistributedCentralized / Distributed

M1 M2 M3 A2 A3A1

Arbiterarbiter/decoder

arbiter/decoder

arbiter/decoderrequest +

grant

S1 S2 S3

Decoderarbiter/decoder

arbiter/decoderS1 S2 S3

A4

decoder

A5

decoder

select

M = masterS = slave

a) Centralized b) Distributed

Fi 2 C t li d di t ib t d t l

#38/45

Figure 2. Centralized vs. distributed control

Complex bus topologiesComplex bus topologies Hierarchical bus - Several bus

segments connected with bridges Fast access as long as the target is in

Hierarchical bus (chain)N = 16L = 2 3g g

the same segment Requires locality of accesses

Theoretical max. speed-up = num of segments

L = 2.3<k> = 2

segments Segments either circuit or packet-

switched together Packet-switching provides more Hi hi l b ( h i Packet switching provides more

parallelism with added buffering Split-bus

No data storage – only three-state

Hierarchical bus (chain + tree)

N = 16L = 2.1

<k> = 2.5

buffers If switches are non-conducting,

smaller effective capacitance and, hence smaller energy

A A A

#39/45

hence, smaller energy

Split-bus

A A A

Other topologiesOther topologies

RingN = 16L = 6.3<k> = 3

3D hypercube

Fully connected, point-to-point networkN = 16L = 1

<k> =

hypercubeN = 8

L = 3.7<k> = 8

<k> = -

Highest performance Clearly not scalable

3-D topologies are hard to map on 2-D

Simple layout Unidirectional ring may

result in long latency

#40/45

Clearly not scalable approach

hard to map on 2 D silicon die

g y Good for pipelines

Topologies: mesh and torusTopologies: mesh and torus2-D mesh and torus are very popularSimple layout for uniformly sized nodesSimple layout for uniformly sized nodes Wrap-around wires in torus need special

attention

2-D mesh

#41/45

2 D meshN = 16L = 4.7<k> = 4

2-D torusN = 16L = 4.1<k> = 5

Topologies: TreeTopologies: Tree Trad. tree has bisection

bandwidth=1 Bottleneck for uniform

traffic Does not matter when the

Rooted, complete, binary tree

N = 16L = 6 5

traffic is localized

Fat-tree has more (or wider) links near root

L = 6.5<k> = 2.9

wider) links near root Becoming more popular as

NoC topology

Trees also constructed so that each node is processing node

Fat tree with butterfly elements and fanout of 2 (binary fat tree)

N = 16L = 6.5

#42/45

processing node <k> = 3.5

Topologies: static analysisTopologies: static analysis Some basic properties may be analyzed statically Simulation with real applications preferred (i.e. dynamic analysis)

N t k N b f N b f Li kN t k P ll l L t Bi ti Li k Network Number of switches

Number of wires

Links

Single bus 0 1 Bi

Multiple bus 0 e Bi

Hierarchical bus (chain) e 1 e Bi

Network Parallel transactions

Longest path

Bisection bandwidth

Links

Single bus 1 1 1 Bi

Multiple bus e (e ≤ N) 1 e BiHierarchical bus (chain) e-1 e Bi

Crossbar N2/4 N2/2 Bi

One-sided crossbar N2/2 N2-N/2 Bi

Binary tree N-1 2(N-1) Bi

Hierarchical bus (chain) e (e ≤ N) e (e ≤ N) 1 Bi

Crossbar N N N-1 Bi

One-sided crossbar N 2N-1 N/2 Bi

Binary tree N 2log2N 1 BiFat tree (fanout 2) Nlog2N 2Nlog2N Bi

Ring N 2N Bi

3-D hypercube N N+(N/2)log2N Bi

Binary tree N 2log2N 1 Bi

Fat tree (fanout 2) N 2log2N N Bi

Ring N N/2+2 2 Bi

3-D hypercube N log2N+2 N/2 Bi2-D mesh N 3N-2N1/2 Bi

2-D torus N 3N Bi

Point-to-point, fully connected

0 (N2-N)/2 Bi

2-D mesh N 2N1/2 N1/2 Bi

2-D torus N N1/2+2 2N1/2 Bi

Point-to-point, fully connected

N 1 (N/2)*(N/2) Bi

#43/45

Omega network (MIN) (N/4)(log2N-1) (N/2)log2N UniOmega network (MIN) N/2 log2N N Uni

Lahtinen 2004: Table 3.2 Performance Lahtinen 2004: Table 3.3 Implementation costs

DaytonaDaytona (2001), OMAP (2004), (2001), OMAP (2004), MPCoreMPCore(2005)(2005)( )( )

Single bus

Two buses

Single bus

#44/45

W. Wolf. et al. , "Multiprocessor System-on-Chip (MPSoC) Technology," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct. 2008

Industrial Industrial exampleexample: : ViperViper byby Philips Philips (2001)(2001)( )( )

Four buses

#45/45

S. Dutta et al., "Viper: A multiprocessor SOC for advanced set-top box and digital TV systems," Design & Test of Computers, IEEE , vol.18, no.5, pp.21-31, Sep-Oct 2001

ST ST NomadikNomadikST ST NomadikNomadik(2003) (2003)

Multiple buses

#46/45 Erno Salminen - Nov. 2010

CellCell BE BE byby IBM/Sony/Toshiba (2005)IBM/Sony/Toshiba (2005) Khunjush, F.; Dimopoulos, N.J.; , "Extended characterization of DMA transfers on the Cell BE processor,"

Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on , vol., no., pp.1-8, 14-18 April 2008

See aldo: D. Shippy, M. Phipps, The Race for a New Game Machine: Creating the Chips Inside the XBox 360 and the Playstation 3 Citradel 2009and the Playstation 3, Citradel, 2009

Four rings


Tile64 Tile64 byby TileraTilera (2008)(2008)2-D mesh with 4 DDR controller for extrnal

memoriesTile = 3-wide 32b VLIW, 750 MHz90nm, 615M tran, 11W90nm, 615M tran, 11W

S. Bell et al., TILE64 -Processor: A 64-Core SoCwith Mesh Interconnect, ISSCC 2008


Faust (Faust (2009)2009)

M difi d 2 DModified 2-D mesh, asynchnoronousNoC

[E. Beigne et al., An Asynchronous Power Aware and Adaptive NoCBased Circuit, JSSC, 2009]


ConclusionConclusionSoC has many components, different

requirementsWire delays and power consumption

becoming very problematicBi diff b t l l d l b l (Big difference between local and global (or off-chip) communicationFully synchronous approach becomingFully synchronous approach becoming

unfeasibleNetwork-on-chip = multi-hop on-chip networkNetwork on chip multi hop on chip network Often packet-switched Buffering, routing, and topology are important


design decisions

NoCNoC SurveySurveyNoteNote: : AllAll slidesslides in in thisthis set set areare lecturelecturematerialmaterial!!


Survey of NetworkSurvey of Network--onon--chip proposals chip proposals [2008][2008][ ][ ]

This paper gives an overview of state-of-the-art regarding the network-on-chip (NoC) proposals.

NoC paradigm replaces dedicated, design-specific wires withNoC paradigm replaces dedicated, design specific wires with scalable, general purpose, multi-hop network. Numerous examples from literature are selected to highlight the contemporary approaches and reported implementation results. Th j t d f N C h d t th t iThe major trends of NoC research and aspects that require more investigations are pointed out.

A packet-switched 2-D mesh is the most used and studied topology so far It is also a sort of an average NoC currentlytopology so far. It is also a sort of an average NoC currently. Good results and interesting proposals are plenty.

However, large differences in implementation results, vague documentation and lack of comparison were also observeddocumentation, and lack of comparison were also observed.

http://www.ocpip.org/uploads/documents/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf


Basic NoC propertiesBasic NoC properties

--- clip clip (39 lines omitted in the slide show)---


NoC implementationsNoC implementations

--- clip clip (14 lines omitted in the slide show)---


Average NoC 2008Average NoC 2008

#55/45 Erno Salminen - Nov. 2010 [Salminen et al. Survey of NoC proposals, OCP-IP, 2008]

Average NoC 2008 (2)Average NoC 2008 (2)


as[Salminen et al. Survey of NoC proposals, OCP-IP, 2008]

Case StudyCase StudyCase StudyCase Study

Managing Interconnection Complexity in Managing Interconnection Complexity in Heterogeneous IP Block InterconnectionHeterogeneous IP Block Interconnection(HIBI)(HIBI)(HIBI)(HIBI)


Overview of Managing OnOverview of Managing On--Chip Chip CommunicationsCommunications

Dedicated point-to-point links

Simple Alwaysguaranteed

LimitedLimited IP block specificyyp

Single bus

nts

nts

WW exib

ility

exib

ility

ss ee

elem

enel

emen

cy&

BW

cy&

BW

ty &

Fle

ty &

Fle

bloc

ksbl

ocks

rk re

use

rk re

use

Hierarchical bus structures

Regular multi-hop topologies et

wor

k et

wor

k

Late

ncLa

tenc

alea

bilit

alea

bilit

# of

IP

# of

IP

Net

wor

Net

worstructures

topologies

Customized multi-hop Verycomple

Designonce

Generalp rpose

Best-effort/Predictable

Ne

Ne

Sca

Sca NN

Arbitrar


p complex oncepurposePredictable Arbitrary

Lessons LearnedLessons LearnedMany communication networks have been studied in

TUT On chip communication research started 1997 On-chip communication research started 1997

A regular topology can well be fitted to algorithm specific comp/comm balanced implementationIn general case there is no optimal topology

Communication-centric design was successfully conducted for performanceconducted for performanceImportant to exploit features of application(s) to optimize interconnection

Established parallel processing doctrines can be applied to SoCSoC challenge is heterogeneity in computation


SoC challenge is heterogeneity in computation

Interconnection Implementation ViewInterconnection Implementation View Make lowest level data transfer mechanisms simple and

efficient Minimum number of signalsg “Every clock edge carries useful data in transaction”

Perform all high-level operations on basic mechanisms Layered protocol model, OCP compatibley p , p Message passing

Use identical HW modules to compose overall interconnection Translate IP specific communication operations to networka s ate spec c co u cat o ope at o s to et o Support all (practical) topologies No limits to number of IP blocks (whole design) Support (re-)configurabilitypp ( ) g y Fit to all communication needs –from memories to peripherals

“Gives body to build interconnect”“Gives body to build interconnect”


System Design ViewSystem Design View Make interconnection aware of application functionality

A) System design time Communication profiled from application processes Communication profiled from application processes Clustering: localization of communication Allocation of communication resources (segments, buffers) Optimization of non-reconfigurable parameters Optimization of non reconfigurable parameters Initial QoS and other transfer parameters

B) Run time Utilize knowledge of predictable communication events if Utilize knowledge of predictable communication events if

available Guaranteed QoS in transfers

Track communication –change QoS & other parameters if required

Totally change mode of operation if required HIBI Design Flow is 80% of the HIBI interconnect scheme


“Gives brains to the communication”“Gives brains to the communication”

HIBI Identical Interconnection ModulesHIBI Identical Interconnection Modules

HIBI wrapper is the only building block used everywhere in interconnectiony Between network and IP-blocks Between network segments Wrapper is parametrizable, modular, and

configurableA FIFO b ff i Asyncronous FIFO buffering

HIBI network

HIBIWrapper

FIFO / OCP i t f

HIBIwrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper


P1 Mem1PN Acc1... AccN...... MemN

interface

IP

HIBI NetworkHIBI Network HIBI network consists of bus segments and bridges

Transfers in segment synchronous circuit switched Transfers across bridges asynchronous packet switched Scales from serial point-to-point link to an arbitrary

topologyp gy

Identical signals between wrappers in network side No dedicated point-to-point signals

All i l h d i hi k All signals shared within network segment Wrapper layout is independent of the number of agents

Totally distributed arbitrationTotally distributed arbitration No central arbiter Each wrapper is aware of communication details


HIBI Network Example

rr rr

HIBIHIBIWrapperWrapper

IP BLOCKIP BLOCK


IP BLOCKIP BLOCK


IP BLOCKIP BLOCKIP BLOCKIP BLOCK


HIBIHIBIWrapperWrapper Bridge

HIB

IH

IBI

Wra

ppe

Wra

ppe

HIB

IH

IBI

Wra

ppe

Wra

ppe





HIBIHIBIWrapperWrapperpppp

IP BLOCKIP BLOCKIP BLOCKIP BLOCK




pppp

IP BLOCKIP BLOCK

pppp

IP BLOCKIP BLOCK

pppp

IP BLOCKIP BLOCK

pppp pppp

IP BLOCKIP BLOCK

Clock domainClock domain


Bus latencyBus latency Total latency consists of several phases From: K. Kuusilinna, PhD Thesis, TUT, 2001.

Action Available MethodsAction Available MethodsRequest bus ownership

Wait for higher priority transactions to complete / Arbitrationrb

itrat

ion

tenc

y

Central arbiter, daisy chain, wired-OR,connectionless arbitration

Round-robin, hierarchical round-robin,time-slot, fixed priority, adaptive

Waiting time may be long during high contection

Bus ownership granted

complete / ArbitrationA

rla

t time slot, fixed priority, adaptive

(See Request)

Begin transaction Address/data multiplexing,handshaking

contection

Until all data has been transferred ora limit for data transfers per burst is reached.

Wait for master ready /Wait for target ready

Transfer first data

Initi

alla

tenc

y

a ds a g

p

Transfer data

Wait for master ready /Wait for target ready

Subs

eque

ntda

ta la

tenc

y

Optimizing this phase has biggest impact in long transfers


Drive or wait for the bus to settle to idle state

Turn

-aro

und

late

ncy

Figure: Bus latency

transfers

HIBI Quality of ServiceHIBI Quality of ServiceTDMA (time division multiple access) with

freely run-time adjustable frame length and y j gslot durations and allocationsRe-synchronization to application phasey pp pAlso traditional priority/round-robin

time frametime frame time frametime frame

allocated time slotA1

competitionA3 A2 A3 A1 A3 t

competition

A3A2

A3A1

A1 A2 A1 A3 A1

Priority

Round-robin

tA2 A3 A1


A2A1 A2 A3 A1 A2 A3 t

HIBI Basic TransferHIBI Basic TransferPipelined with arbitrationSplit-transactionsSplit transactionsBurst transfersNo wait cycles allowedNo wait cycles allowedNon pre-emptive transfers QoS is guaranteed with TDMA or with a QoS is guaranteed with TDMA or with a

combination of Send Max+Priority/RoundRobinpipeline

rq addr

ret addr

addr

data

w addr

w data ret dataw data

w addr rq addr ret addr

rq data rq data

ret addr ...


t

ret addrdata w data ret dataw data rq data rq data

split transaction

HIBI Wrapper Structure (v.2)HIBI Wrapper Structure (v.2)

IP signals in IP signals out

HI prior tx FIFO

LO prior tx FIFO

HI prior rx FIFO

LO prior rx FIFO

M D

Config memTx FSM

Mux Demux

Addr decoderRx FSM


HIBI signals out HIBI signals in

Wrapper Configuration MemoryWrapper Configuration Memory Stores all information for distributed arbitration

Permanent: ROM, 1 page Semi run-time configurable: ROM with several pages Full run-time configurable: RAM, with pages

Curr page

Curr conf

C f

Newconf

values

Dem Mux

Time slot

valuesConf page

Timeslot

mux


logicslotsignalsCycle counter

HIBI Wrapper Area in ASICHIBI Wrapper Area in ASIC

35 000

25 000

30 000

35 000

RAMROM

15 000

20 000

Area

[gat

es]

ROM

5 000

10 000

A

08 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b

lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0




1-page mem 1-page mem 2-page mem

Runtime comparisonRuntime comparisonSalminen et al., SAMOS 2005.


OtherOther notesnotes on on NoCNoC


Network topology categoriesNetwork topology categories1. Static networks utilize only point-to-point or

shared connection lines2. Dynamic networks use switches (or routers)

for communicationa) Direct = each processing node connected to

switchb) Indirect = some switches are not connected

directly to any processing node


Problems with Current NoC DiscussionProblems with Current NoC DiscussionWhat is ”NoC” – no common definition

Something new, good by definition (needs no proof),...General purpose – but to what extentGeneral purpose – but to what extent

Arbitrary connectivity between any node? Uniform overall transfer distribution?

Discussion about “optimal topology” Discussion about optimal topology Multiprocessor architectures for scientific computations? Can massive fine-grain granularity parallelism be utilized in

realistic SoC applications?realistic SoC applications? Copying computer network ideas without criticism

In-network data buffering, routing tables and algorithms Compare to current TCP/IP or past ATM routers! Compare to current TCP/IP or past ATM routers!

Toy test case applications Billion transistors – executes single FFT? Common benchmarks should be designed!


Common benchmarks should be designed!

Wiring hierarchyWiring hierarchy How far can signal reach

in one local clock cycle? Depends on Depends on

frequency (i.e duration of clock cycle)

Wiring parameters (layer

l b l

Wiring parameters (layer, width, height, density, shielding)

Not far anyway global

intermediate

Not far anyway... Global wires will function

as lossy transmission linesRC d l f d

local

RC models of today become inaccurate

3-D modeling s-l-o-w and difficult


[H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003]

difficult

Crosstalk impactCrosstalk impactLong fast switching wires

Long wires close to each otherLong, fast switching wires

Switching on neighbor g gwires affects delay

Delay on wire 4 shown in table 2


P. Liljeberg et al., Self-timed Approach for Noise Reduction in Noc, in “Interconnect-centric design for advanced SoC and NoC”, Kluwer. 2004

Transaction latency components


Scalable Multiprocessors, lecture slides, http://www.cs.princeton.edu/courses/archive/spr07/cos598A/

Impact of DMAImpact of DMA agent

CPU core data mem DMA network i)CPU core

instr. mem

data mem DMA interface

other perihp.

ii)

comp comp ...w/o DMAa) short comm time

compw/ DMA

)

comp comm compw/o DMA

comp ...

comp comm comp ...w/o DMA

compcommcomp

comm...compw/ DMA

b) equal comp and comm time

comm comm ...w/o DMA

w/ DMA

c) long comm time


comm comm...w/ DMA

Retransfer buffersRetransfer buffers If packets are dropped or corrupted in delivery (usually) they have to

retransferred Variable latencies problematic: is packet dropped and just havinf longer latency If Time-out latency exceeded , packet is assumed to be missingy p g

Source must store packets until it recieves acknowledge of succesfull transfer Sending acknowledge after each packet results in small buffer but (at least)

double latency Sengin ack after each N packet reuires bigger buffers but gives better g p gg g

performance

source destination

ack (ok)a) ack for each packet

src

buf

dst Latency per pkt = send_latency + ack_latency

b) ack for each N src dst

Latency per pkt =

#79/45 Erno Salminen - Nov. 2010ack (ok,ok,fail,ok)

each Npackets

src

buf buf buf buf

dst (N*send_latency + ack_latency) / N

Reordering buffersReordering buffers Packets arriving Out-of-order may require huge reordering buffers

Sometimes processing units may accept out-of-order delivery or buffers can be integrated with internal memory of the processing unit

If ack is sent after 4 packets buffer for 4 packets is needed If ack is sent after 4 packets, buffer for 4 packets is needed Furthermore, separate buffers are needed for each source as data may

received in interleaved manner E.g. (pkt_<n>_<src>) received: pkt_1_1, pkt_4_1, pkt_4_2, pkt_3_3... E if k t ft N k t d S E.g. if ack sent after N apckets and S sources

reorder buffer size = N*S packets

source 0 destination source 1source 0 destination

ack (ok)a) ack for each packet

src dst

bufAck forces in-order delivery

source 1

b) ack for each N

dst

buf buf buf bufsrc buf buf buf buf


ack

each Npackets

src buf buf buf buf

buf buf buf buf...

ack

Buffer reservationBuffer reservation

Notification ofSender agent Receiver agent Sender agent Receiver agent

Notification of the next tx

Reserve buffer Notification of the reserved buffer

Reserve buffer

Configure rx DMA

ACK

Configure rx DMA

Actual data

(optional ACK)

Actual data

(copy data)

C d t

Observedtx duration

Consume data

Reserve buffer etc.

Consume data

Observedtx duration


Intertiwned/ReorderingIntertiwned/Reordering Transfers from different

sources may arbitrarily i t t i d

destination0

i) fixed-length packets

intertwined In addition, packets may

arrive out-of-order...

ddee

aabbcc

dd aa bb eecc

from

net

wor

k

arrive out-of-order

source0

”FIFO”-like buffers

ii) variable-length packets

netw

ork

source1destination0

aabbcc

ddee ddaabbeecc

destination0

netw

ork

...dd ee

dd aa bb eecc

These are either single words, bursts, or packets, depending on

the network

from

cc

linked list buffers

aa bb


Irregular IP sizeIrregular IP size IP’s tend to have irregular size and shape Largest IP per row/column decides its height/width

S Some space wasted links will have varying length

Reordering the IPs reduces areag Ensure that frequently communicating IPs are still close to

each other


<19.5% reduction in area>

Customized meshCustomized meshConnect more than IP to one routerSomewhat smaller bandwidth available per IPSomewhat smaller bandwidth available per IP Usually enough, though

Adopt totally customized topology (theAdopt totally customized topology (the rightmost fig)


Date post:	29-May-2018
Category:	Documents
Upload:	lythu
View:	244 times
Download:	1 times

TKT-2431 Soc Design · TKT-2431 Soc Design ......

Documents