+ All Categories
Home > Documents > High-speed bus arbiter for multiprocessors

High-speed bus arbiter for multiprocessors

Date post: 20-Sep-2016
Category:
Upload: ab
View: 218 times
Download: 4 times
Share this document with a friend
8
High-speed bus arbiter for multiprocessors A.B. Kovaleski, B.Sc. Indexing term: Microprocessors Abstract: Shared-bus interconnection schemes normally suffer from insufficient capacity. Increasing their bandwidth reduces the problem but makes bus arbitration somewhat difficult. This paper presents a fair bus-arbiter design, its implementation and simulation results. Although the techniques originated from the particular constraints of the architecture considered, it is generally applicable to high-speed arbitration prob- lems and has a low hardware cost. 1 Introduction The classical shared bus is one of the most common multi- processor interconnection schemes and remains attractive as a way of connecting small numbers of microprocessors. Its simplest form is embodied in standards such as the Multibus (IEEE P796). These usually employ fixed, position-dependent priority schemes and have limited bandwidth because the bus is always allocated for at least one complete memory cycle. If processors have no local memory, then the bus will saturate with only two or three active [1, 2] (roughly the average fetch period divided by the processor's memory cycle time). Performance is greatly improved if code is kept in local mem- ory, but system bandwidth is still limited by the processor's memory cycle rather than bus or memory bandwidths [1,2]. Systems based on shared buses can be further improved by several methods. In a general-purpose system, where processors are deemed equal, fair bus arbitration gives higher throughput and simplifies task-processor allocation. Splitting the memory cycle into two halves (send address packet, return data packet) allows full utilisation of the bus at the price of extra logic at the processor and memory interfaces. Together, these techniques maximise the usefulness of a shared bus, but, nevertheless, limit it to supporting tens of processors [3, 5]. The only way to make such a system extensible almost with- out limit is to allow buses to be interconnected. Since circuit switching has significant deadlock potential, the bus link must be able to queue and route packets [6] ; hence we arrive at the concept of a packet-switched bus network. Depending on its topology, such an architecture allows access times to be kept within reasonable bounds (a 'tree' makes time grow at the logarithm of the number of buses). To make full use of this, the phenomenon of 'program locality' should be exploited so that processors make most nonlocal accesses to their own bus [6]. /. / Bus network architecture The above observations lead to an architecture which comprises buses and devices (processors, memories, links and input/ output). Perhaps the term 'shared bus' becomes misleading in this context because it has been transformed into a general information exchange [4], rather than a simple backplane. The adoption of a fast, synchronous, split-cycle bus protocol reinforces this view, since it requires that the bus be physically small (less than one metre [5]) and have its interfaces closely associated. The bus can be summarised as follows: (i) it is internally synchronous (ii) the bus is allocated for one bus-clock cycle (iii) data transfer and arbitration are overlapped and synchronous with the bus clock Paper 2263E, first received 7th June and in revised form 12th October 1982 The author is with the Department of Electronic & Electrical Engineering, University College London, Torrington Place, London WC1E 7JE, England 1EE PROC, Vol. 130, Pt. E, No. 2, MARCH 1983 (iv) a bus cycle is either allocated or idle (v) a new allocation is possible during each cycle (vi) all devices are treated equally (vii) up to 32 devices (processors, memories or links to other buses) are allowed on each bus. /. /. 7 Devices: These are of two basic kinds, masters and slaves. A master (such as a processor) sends requests at arbitrary intervals and expects acknowledgment/data to be returned, whereas a slave receives requests, queues them until they can be serviced and expects no acknowledgement for return trans- missions. Note that a processor receiving nonlocal interrupts must behave like a slave. Queueing is necessary because the bus is effectively packet- switched, and to prevent costly multiple requests a slave must be able to accept requests at full bus speed and service them later. The lengths of the queues are bounded by the fact that a master may not continue until it receives acknowledgment or times out; thus the maximum queue size needed by a slave is given by the number of masters that can access it. A link clearly must consist of two queues and two address maps indicating which packets it should accept from each bus. This allows both optimal routing in arbitrary graphs and the checking of address validity. 1.1.2 Buses: The design for the clock speed was 20 MHz, allowing the transfer of a 64-bit packet (16 bits data, 32 bits destination address, 16 bits source address) evey 50 ns bus cycle. This poses a very difficult arbitration problem and clearly precludes any serial arbitration schemes such as daisy chains, since these would require a stage delay of less than 1.6 ns! The p r oblem is further complicated by the need to operate a fair strategy such that all devices hav5 equal rights on average. 7.1.3 Bus cycle protocol: Fig. 1 shows the basic timings in- volved in requesting to transmit a packet via a bus while Fig. 2 defines the hardware environment in which the arbiter net- work operates. Two devices (n and rri) request the bus in the same cycle. Their requests are synchronised by the LI latches and then again on the next cycle by the L2 latches. With a 50 ns bus clock period, this ensures that a metastable state of less than 40 ns (allowing for normal delays) at LI will be ignored. At the request rates likely (a few megahertz) and using 74S74 latches, this gives a synchroniser failure rate of less than one per year [7]. Overall, the probability of system failure is much smaller, since Li's metastable state must trigger a metastable state in L2 before incorrect arbitration (multiple acknowledgment) is possible. It is thus reasonable to consider that the arbiter operates in a completely synchronous environment and has almost a complete bus cycle to settle. In Fig. 1, n happens to be chosen first and its transfer occurs on the next cycle, during which m is being chosen to transfer on the following cycle. 0143-7062/83/020049 + 08 $01.50/0 49
Transcript
Page 1: High-speed bus arbiter for multiprocessors

High-speed bus arbiter for multiprocessorsA.B. Kovaleski, B.Sc.

Indexing term: Microprocessors

Abstract: Shared-bus interconnection schemes normally suffer from insufficient capacity. Increasing theirbandwidth reduces the problem but makes bus arbitration somewhat difficult. This paper presents a fairbus-arbiter design, its implementation and simulation results. Although the techniques originated from theparticular constraints of the architecture considered, it is generally applicable to high-speed arbitration prob-lems and has a low hardware cost.

1 Introduction

The classical shared bus is one of the most common multi-processor interconnection schemes and remains attractive as away of connecting small numbers of microprocessors. Itssimplest form is embodied in standards such as the Multibus(IEEE P796). These usually employ fixed, position-dependentpriority schemes and have limited bandwidth because the busis always allocated for at least one complete memory cycle.If processors have no local memory, then the bus will saturatewith only two or three active [1, 2] (roughly the averagefetch period divided by the processor's memory cycle time).Performance is greatly improved if code is kept in local mem-ory, but system bandwidth is still limited by the processor'smemory cycle rather than bus or memory bandwidths [1,2].

Systems based on shared buses can be further improved byseveral methods. In a general-purpose system, where processorsare deemed equal, fair bus arbitration gives higher throughputand simplifies task-processor allocation. Splitting the memorycycle into two halves (send address packet, return data packet)allows full utilisation of the bus at the price of extra logicat the processor and memory interfaces. Together, thesetechniques maximise the usefulness of a shared bus, but,nevertheless, limit it to supporting tens of processors [3, 5].The only way to make such a system extensible almost with-out limit is to allow buses to be interconnected. Since circuitswitching has significant deadlock potential, the bus linkmust be able to queue and route packets [6] ; hence we arriveat the concept of a packet-switched bus network. Dependingon its topology, such an architecture allows access times tobe kept within reasonable bounds (a 'tree' makes time growat the logarithm of the number of buses). To make full use ofthis, the phenomenon of 'program locality' should be exploitedso that processors make most nonlocal accesses to their ownbus [6].

/. / Bus network architectureThe above observations lead to an architecture which comprisesbuses and devices (processors, memories, links and input/output). Perhaps the term 'shared bus' becomes misleading inthis context because it has been transformed into a generalinformation exchange [4], rather than a simple backplane.The adoption of a fast, synchronous, split-cycle bus protocolreinforces this view, since it requires that the bus be physicallysmall (less than one metre [5]) and have its interfaces closelyassociated. The bus can be summarised as follows:

(i) it is internally synchronous(ii) the bus is allocated for one bus-clock cycle

(iii) data transfer and arbitration are overlapped andsynchronous with the bus clock

Paper 2263E, first received 7th June and in revised form 12th October1982The author is with the Department of Electronic & Electrical Engineering,University College London, Torrington Place, London WC1E 7JE,England1EE PROC, Vol. 130, Pt. E, No. 2, MARCH 1983

(iv) a bus cycle is either allocated or idle(v) a new allocation is possible during each cycle

(vi) all devices are treated equally(vii) up to 32 devices (processors, memories or links to

other buses) are allowed on each bus.

/. /. 7 Devices: These are of two basic kinds, masters and slaves.A master (such as a processor) sends requests at arbitraryintervals and expects acknowledgment/data to be returned,whereas a slave receives requests, queues them until they canbe serviced and expects no acknowledgement for return trans-missions. Note that a processor receiving nonlocal interruptsmust behave like a slave.

Queueing is necessary because the bus is effectively packet-switched, and to prevent costly multiple requests a slave mustbe able to accept requests at full bus speed and service themlater. The lengths of the queues are bounded by the fact thata master may not continue until it receives acknowledgment ortimes out; thus the maximum queue size needed by a slave isgiven by the number of masters that can access it.

A link clearly must consist of two queues and two addressmaps indicating which packets it should accept from eachbus. This allows both optimal routing in arbitrary graphs andthe checking of address validity.

1.1.2 Buses: The design for the clock speed was 20 MHz,allowing the transfer of a 64-bit packet (16 bits data, 32 bitsdestination address, 16 bits source address) evey 50 ns buscycle. This poses a very difficult arbitration problem andclearly precludes any serial arbitration schemes such as daisychains, since these would require a stage delay of less than1.6 ns! The problem is further complicated by the need tooperate a fair strategy such that all devices hav5 equal rights onaverage.

7.1.3 Bus cycle protocol: Fig. 1 shows the basic timings in-volved in requesting to transmit a packet via a bus while Fig. 2defines the hardware environment in which the arbiter net-work operates. Two devices (n and rri) request the bus in thesame cycle. Their requests are synchronised by the LI latchesand then again on the next cycle by the L2 latches. With a50 ns bus clock period, this ensures that a metastable stateof less than 40 ns (allowing for normal delays) at LI will beignored. At the request rates likely (a few megahertz) andusing 74S74 latches, this gives a synchroniser failure rate ofless than one per year [7]. Overall, the probability of systemfailure is much smaller, since Li's metastable state musttrigger a metastable state in L2 before incorrect arbitration(multiple acknowledgment) is possible.

It is thus reasonable to consider that the arbiter operatesin a completely synchronous environment and has almosta complete bus cycle to settle. In Fig. 1, n happens to bechosen first and its transfer occurs on the next cycle, duringwhich m is being chosen to transfer on the following cycle.

0143-7062/83/020049 + 08 $01.50/0 49

Page 2: High-speed bus arbiter for multiprocessors

Arbitration is thus pipelined with data transfer, such thatboth can be performed on each bus cycle. Note that therequesting device must negate its request while 'acknowledge'is asserted to clear the pipeline. The requestor's interface mustalso latch the 'accept' signal at the end of its transfer and loga nonexistent device error if 'accept' is not asserted. Thereceiving device (processor, memory link) must thus decode itsaddress and assert 'accept' before the end of the current cycle.

All devices obey the same protocol, so a processor's read toan on-bus memory will involve two cycles, one initiated bythe processor's interface and the other initiated by the mem-ory's interface.

bus clock CLK

request NREQ

SREQ Nacknowledge N ACK*

request M

SREQ Macknowledge M

bus user

accepted A*

Fig. 1A Bus timing and protocol

req

Table 1: 4-device linear rotating priority

nncz

elk synchroniser arbiter pipeline latch

F ig. 1B En vironmen t of arbiter

2 Arbitration schemes

Equal priorities are desirable, since they ensure that processorswith similar workloads obtain similar service, even when theoverall bus loading is high. Fixed priority schemes give lowpriority processors poor access when the bus is nearly saturatedand complete lockout is possible if the access patterns ofhigher priority processors are very regular. In systems allowingdynamic allocation of tasks to processors, the schedulingproblem is greatly simplified if all processors are equal. Thiskind of 'floating master' operating system is most likely to beapplicable to general-purpose computing, but also arises inembedded applications where 'graceful degradation' is re-quired.

A system that assumes fairness must achieve it. If a parallelgroup of tasks is 'spawned', then their results must be collatedwhen they complete. The group execution time is thus deter-mined by that of the slowest task in the group. Should thelongest task be allocated to the slowest processor, the overallthroughput will be seriously degraded.

2.1 FairnessIf all devices are to be deemed basically equal, then a conceptof 'fair allocation' must be defined. Necessary conditions are:

(i) equal priority on average(ii) guaranteed access in a finite time.

Time(bus cycle)

1234567

Priority of device

0

4321432

1

1432143

2

2143214

3

3214321

Priority state

sOs1s2s3sOs1s2

State-controllines

Ko

0101010

0011001

level 2 level 3

ANY 16 signals

device J I — ^ ANYA signals

Fig. 2 The 4:4:2 scheme

A strictly fair arbiter will ensure that each device of the nactive devices will obtain at least one access in n active buscycles.

2.2 Linear rotating priorityConceptually, this scheme is very simple. Each processor isgiven a time-varying priority which is increased once per buscycle until it reaches maximum, and is then reduced to mini-mum to repeat the cycle (Table 1). Simulations show that thisonly works well under the unrealistic condition of equalprocessor workloads and where the priority rotation skipsinactive devices. If the cycle includes all 32 possible devices(irrespective of whether they exist), then relative priorities arenot equal on average. For example, consider devices '0' and ' 1 'only, in Table 1. Thas a higher priority 75% of the time. Thiseffect leads to significant speed variations that are difficultto model and hence predict. Although these inhomogeneitiesdepend strongly on the regularity of processor accesses, theyare significant for 'real' code fragments when the bus loadingis high. From the discussion of Section 2, it is clear that this isundesirable; but there is a further and more serious problemwith this scheme. If the logic is implemented in a sum ofproducts form then the number of product terms rises withthe cube of the number of devices arbitrated. To see why this

Table 2: 2-device linear rotating priority

Time(Bus cycle)

1234

Priority of device

0

_» r

o

to

1

to

-» t

o

Priority state

sOs1sOs1

State-control lineK

0101

50 IEEPROC, Vol. 130, Pt. E, No. 2, MARCH 1983

Page 3: High-speed bus arbiter for multiprocessors

Table 3: 2-devica arbiter truth table where

Inputs'o l\

State controlK

OutputsOo

00001111

00110011

01010101

// is true when a request is pending from device;. 0 < / < 1.Oj is true when requesting device/ is acknowledged, 0 < / <K is true in state s"! and false in 50

is so, the Boolean equations required to generate the prioritypattern of Table 1 will be derived.

Initially, it is simpler to take a 2-device case. This requirestwo request lines by which devices request access to the bus,two acknowledge lines by which access is granted and onestate-control line (only two priority states exist).

Table 2 shows the priorities of the system in both states.From this, a truth table (Table 3) may be readily derived. Byinspection, the Boolean equation for output O(O0) is

O0 = IoK+IJo

An inverted sum of products form is more useful:

O0 = h

Similarly,

Ol = T1+KI0

(1)

(2a)

(2b)

Intuitively, each output (acknowledge) must follow its input(request) unless blocked by a higher priority input. It is clearthat, when K is true,/! can block Oo, which is consistent withdevice 1 having a higher priority in state si .

This result may be extended to n devices, and will take theform

j =Tj+ X Cjk\ I iklk=l 1 = 1

(3)

where

is the/th acknowledge line for the/th device

is the /th request line for the /th device

and Cjk is true in only one priority state (all Cjk are mutually

KKn-1

exclusive). The set excludes the state in which the /th devicehas highest priority.

And Ikl, 1 < / < £ , is the set of inputs of higher prioritythan/ in the state (S) where Cjk is true.

From this, we can obtain the 4-device equations that gen-erate Table 1, choosing the sets /feJ for linear rotating priority:

Oo =T0+ C , / ,

Ot = I, + C2I2

02 = T2+ C3I3

0 3 = / 3 + CoIo

C2(h

C3(I2

C0(h

CaCA +h+h)

C0(I2 +h+ /<>)

C,(/3 +/<> + / i ) (4c)

C2(I0 +h +/2)

Co —

C2 —

C\ — KQK\

KO , Kx are the state-control lines, as Table 1.If the equations are expanded into inverted sum of pro-

ducts form, then the number of products in each equation (theOR width required) for n devices is

S = 1 +n(n-\)/2 (5)

Since we require n equations for n devices, the total numberof products is

P = ns (6)

If the linear rotating scheme is applied to the 32-device busunder consideration, then we obtain:

S = 497

P = 15 904(7)

IEE PROC, Vol. 130, Pt. E, No. 2, MARCH 1983

Such a large number of terms is not practical for hardwareimplementation without resorting to custom VLSI and thescheme does not perform well enough to justify this. However,note that the simplicity of eqns. 4 allows them to be imple-mented in readily available field programmable logic arrays.Two complete sets of 4-device logic will fit in one MonolithicMemories PAL16L8 with a typical propagation delay of25 ns [8]. This realisation is very attractive if the cubic ex-plosion of terms (eqn. 6) can be avoided by modularising the32-device problem into a set of 4-device problems. A hierarchi-cal decomposition of 32-device arbitration should allow the useof a set of 4-device arbiters and result in a reasonable packagecount.

2.3 Strictly fair allocationIn this scheme, each active allocation of the bus reduces theactive device to minimum relative priority and raises therelative priority of all other devices. The conditions for fairallocation (Section 2.1) are thus ideally met, rather thanapproximately or statistically. A hardware implementationhas been presented [9], and it is sufficiently simple to operateat the bus speeds under consideration. It relies on an extensionof parallel polling; an active device broadcasts to all othersto modify their priority masks while setting its own to aminimum. Hence the arbitration is decentralised, has no welldefined internal state and can assume many invalid states ifa broadcast is corrupted by noise. At high speeds, noise prob-'lems are to be expected, so error recovery must be built intothe system.

Two major types of invalid state exist. First, multipleacknowledgment may occur, leading to more than one pro-cessor driving the bus at the same time and, secondly, thearbiter can become deadlocked, giving no device access to thebus. Although both types of error are easy to detect, thisrequires extra logic, and, even so, multiple access will causeinvalid bus levels which could lead to other parts of the sytembeing corrupted. The hardware would in practice be complicatedby the need for detection and recovery circuits.

2.4 Multilevel rotationIt was noted in Section 2.2 that the rotating priority can beefficiently implemented for small numbers of devices, especiallyfour. The use of this in a hierarchical arbitration structuremakes it extensible to a 32-device bus. Consider the prioritypattern of Table 4; it can be seen that Ko swaps the prioritiesof '0' and ' 1 ' and of '2' and '3 ' , while Ki swaps the relativepriorities of the pairs. This can be described as a 2:2 rotation,since there are two devices in each group and two groups.

51

Page 4: High-speed bus arbiter for multiprocessors

Table 4: 2:2 rotating priority

Time(bus cycle)

1234567

Priority of device

0

4123412

1

1432143

2

2341234

3

3214321

Priority state

sOs^s2s3sOs1s2

State-controllines

Ko

0101010

Kx

0011001

processor and comprise the repeated execution of a codefragment to perform

Note that this scheme does have the unfairness inherent inthat of Section 2.2.

For 32 devices, the number of levels and the number ineach level must be chosen. The convenience of the 4-devicearbiter naturally leads to a 4:4:2 arrangement. There areeight groups of four devices, Fig. 2. Each group is handledby a 4-arbiter and an 'ANY4' signal generated for the group.The eight 'ANY4' signals can now be treated as two groupsof four fed to two 4-arbiters. Finally, there are two 2nd-level'ANY16' signals to arbitrate. Since the group signal can begenerated by one level of OR-gating, the rotating arbitrationsin the three levels can be overlapped. Section 3 will demon-strate that such an organisation can meet a 50 ns time con-straint.

Having partitioned the problem efficiently and arrived at atractable solution, there is still a further degree of freedom toexplore. There are three levels of rotation, five associatedstate lines and hence 32 distinct states. Two obvious timeorderings exist, level 3 fastest (2:4:4) and a level 1 fastest(4 ;4 ;2). 2:4:4 rotation gives better results because it presentsthe requesting device with a more pseudorandom pattern ofpriorities. Nonuniform processor performance arises becauseof priorities independent of demand (the residual unfairnessof the 4-arbiters) and because processors can fall into stepwith the pattern. The 2:4:4 pattern is most significant inbreaking up such synchronisations.

The priority states of the system follow a complete binarysequence repeating every 32 cycles (1.6/is) and hence there areno invalid priority codes. A simple binary counter can cycle thesystem, and under no circumstances can the internal state belost — even if the counter fails, a fixed priority prevails,leading only to reduced performance. Similarly, the cor-ruption of a request line (transient or permanent) can onlylead to wasted cycles where no device is requesting. Obviouslythe system fails if the counter has failed (the priority state isfrozen), and a request line is jammed active.

The arbitration scheme described is fast, efficient in hard-ware requirements and operation and quite robust. It appearsto be a good compromise between strict fairness and reliability.

2.5 Comparison o f Simula tion resultsA Fortran program has been written to simulate the archi-tecture cycle by cycle. It operates at the 'message' level, onlythe source, destination and timing of packets is considered. Aprocessor is a device which requests access to the bus atintervals and a memory accepts packets and possibly requeststo return packets. No gate-level details are considered, however,since the system operates synchronously, the results should bevery accurate. To obtain comparisons, a reasonable case mustbe considered. The one presented is of pure bus contentionand no memory sharing. Consider TV processors with noprivate memory accessing TV memories for their code and data.Since the maximum number of bus devices is 32, this allowsup to 16 processors, which is enough to drive the bus intosaturation. The simulated workloads are identical for each

where a and b are arrays of 32-bit objects. Although thisfragment is simple and deterministic, it allows a quite reason-able access pattern to be generated.

The hardware considered consists of 8 MHz Motorola68000 processors, 150 ns memories and a 20 MHz bus clock.Thus, for this fragment, 90% of the fetch periods are 500 ns,with an average of 537ns. Since the effective bus bandwidthfor reading is 10 MHz, the expected saturation performance isequivalent to 5.37 ideal M68000 processors.

This configuration was simulated for the three cases oflinear, 4:4:2 and 2:4:4 rotating priority schemes for 1through 16 processors. The purpose of these results was onlya comparison of the three schemes, so the LI latches (Section1.1.3), which merely involve extra delay in the simulation,were not implemented. Initial results indicate that their in-clusion does not effect the relative behaviour of the prioritysystems.

The performance measures plotted in Figs. 3 to 5 aredefined as

NQ = NEtJE, (8)

2 U 6 8 10 12 - 14number of processors

Fig. 3 Simulation of a (i): = b (j) code fragment performance

6 8 10 12number of processors

14

Fig. 4 Simulation of a(ij: = b(j) code fragment interprocessor interference

52 IEE PROC, Vol. 130, Pt. E, No. 2, MARCH 1983

Page 5: High-speed bus arbiter for multiprocessors

where

Nq is the number of equivalent processors (ideally N)

N is the number of real processors

Et is the fastest possible average execution time

Es is the average execution time observed in simulation.

In Fig. 3, Es is averaged over all processors, whereas, in Fig. 5,Es is the average for the slowest processor, thus giving a lowerbound for the performance of the system (Section 2). Theaverage number of machine wait states incurred on each accessis also plotted (Fig. 4) as a measure of loss, and thus inter-processor interference. Note that there is a minimum of twowait states owing to the 350ns effective access time of thesystem (memory plus four bus cycles).

xio"1

33

50

2 45o

ijj 40oo

£30

JIV

O.I

n

a- "o 20

| 15

c 10

5

- /

r

f6 8 10 12

number of processors14 16

Fig. 5 Simulation ofa(i) = b (j) code fragment worst-case system per-formance

As one might expect, the curves are similar in terms of totalsystem performance and interference. The 4:4:2 scheme ismarginally the best in this sense, whereas the unimplementablelinear scheme is the worst. However, it is in the comparison ofworst-case performance that the differences are most clearand, on this basis, the 2:4:4 scheme is the overall winner.Note that processors only begin to run at different speedsnear saturation when seven are executing, and that the effect isworst at around 13 processors, the middle of the saturationregion.

Several factors contribute to this observed inhomogeneousbehaviour. If the bus is not physically full of devices, thentheir relative priorities are not exactly equal on average, and inthis respect 4:4:2 and 2:4:4 are better than the linearscheme. This does not, however, explain the difference be-tween 2:4:4 and 4:4:2, since their degree of fairness is thesame. The better worst-case behaviour of 2:4:4 can beaccounted for by the fact that it presents the requestingdevice with a more pseudorandom pattern of priorities,making it less likely for a processor to fall into step with thepriority cycle. Processors that obtain an unfair advantage willmake others run more slowly when the bus is saturated.

Initial investigation into a 2:2:2:2:2 arbiter structure,which can be realised merely by programming the PALsdifferently, indicate that processor-priority synchronisationeffects are the most significant. Although this variant alwaysgives equal relative priorities on average, it does not seem togreatly improve on the behaviour of the 2:4:4 scheme.

The performance shown compares favourably with that ofsimilar shared-bus machines. In Reference 4, the reported per-

formance is 5.5 equivalent for seven real processors, usinga 67 ns bus cycle. This is with respect to the specially designedprocessors used, when operating in the multiprocessor underconditions of no interference. Using general-purpose M680000s,the loss (compared to ideal) in absence of interference is 31%;hence if we calculate equivalent processors as in Reference 4,we obtain 6.3 equivalent for seven real (worst-case, 2:2:4priority). It should be noted that the examples shown aredesigned to present a very heavy load. In practice, at leastsome of the code would be stored local to the processor,reducing its bus demands and making them less regular (thisimproves worst-case behaviour).

3 Bus arbiter implementation

The arbiter is in the form of three parallel sections in a 1-steppipeline; this gives the necessary speed of one arbitration perclock cycle while overlapping arbitration and data transfer.The allocation of cycle n is decided during cycle n — 1. Asstated in Section 1, requests are synchronised, so that thearbiter has almost one complete bus cycle to settle.

3.1 DescriptionFigs. 6 show the full circuit which is naturally partitionedinto three levels and a priority-state generator. The level-1'ANY4' signals are generated by NOR gates (Fig. 6A) and thelevel-2 signals are derived from these using NAND gates (Fig.6B). This is a necessary compromise in view of the devicescurrently available and requires the PAL2 device to haveinverted inputs — fortunately this variation is trivial to pro-gram. PALs were chosen to reduce the package count andmeet speed constraints. A Fortran program called PALASM[8] transforms sum of products equations into object code forthe PROM programmer and hence makes the use of PALsstraightforward. Note that eqns. 4 only need to be expandedout to put them in the correct form for the PALASM program(see Appendix 7). Level 3 (Fig. 6C) is sufficiently simple toimplement directly in Schottky SSI using fast AND-OR-INVERT logic, two copies of which are required to satisfyfan-out limitations.

The critical path will normally lie in level 2, since PALdelay is usually dominant. However, a typical Texas advancedlow-power Schottky PAL and worst case AND-OR-INVERTlogic can move it to level 3. From Section 1.1.3 and Fig. IB,it can be seen that the arbiter network (PALs and gates) hasthe following time to settle [10] :

Ts = bus clock period — latch set-up — gate delay — latchpropagation

= 39 ns (typical)

35.5 ns (worst)

whereas the arbiter network delay [8, 10] is:

44ns (worst, standard PAL)

31 ns (typical, standard PAL)

34 ns (worst, 'A' series PAL)

21ns (typical,'A'series PAL)

26 ns (worst, ALS PAL)

18 ns (typical, ALS PAL)

21 ns (typical ALS PAL, worst AND-OR-INVERT)

Hence Monolithic Memories 'A' series and Texas ALS PALsare always safe while worst case standard PALs will not work.

The state generator (Fig. 6D) consists of a simple binary

IEE PROC, Vol. 130, Pt. E, No. 2, MARCH 1983 53

Page 6: High-speed bus arbiter for multiprocessors

counter clocked out of phase and resynchronised by latchingto minimise skew. The total package count for the arbiter is44, the largest single contribution coming from the pipelinelatches because denser latches are significantly slower. Thisrelatively large count is offset by the large number of devicescontrolled, giving an average of 1.34 arbiter packages per busdevice. Overall, the complexity is sufficiently low to indicatethe feasibility of a bipolar VLSI solution, especially in thelight of regularity of the gating [11]. Since a single line couldbe used for both request and acknowledge (via a simple proto-

ANYA 2n

2 0 r

FAL1 nIAO 0A0IA1 0A1IA2 0A2IA3 0A3IBOOBOIB1 OBIIB2 0B2IB3 0B3K0K1

PAP3

ANYA 2rul

G3mG2 2rv1

G2 2n

DQ

DQ

8n

8rv1

DQ8rv2

DQ

>

—08rv3

DQ

>

—08i>A

DQ —08rv5

DQ

>

— 08r>6

DQ —08rv7

CLK

Fig. 6A Level-1 arbitration

Ik, request input k; Ok, acknowledge output k, 0 < k < 31, 0 < n < 3,m = int (n/2)

ANY 16 0

ioTPAL 2

IAO OAO

IA1 0A1

IA2 0A2

IA3 0A3

IBO 0B0

IB1 OBI

IB2 0B2

IB3 0B3KO K1

,10

P2

P1

ANY 16 1

Fig. 6B Level-2 arbitration

54

col) the device would only need 35 pins and thus presentsno pinout problems.

3.2 TestingFig. 7 shows a minimal 4-output test circuit, which containsall of the critical signal paths present in the full arbiter. Thetotal number of priority and input states is 1024, all of whichwere simulated by a trivial Fortran program and which aresupplied by the synchronous counter chain shown. Con-sistent with the expected usage of the design, the prioritystate is made to change much faster than the simulated inputsIo through I5. Note that, to inject some extra delay, thecounter chain uses TTL logic and that the resynchronisinglatches were not implemented.

The circuit was tested by logging 256 states at a time witha Hewlett Packard HP1615A logic analyser, and the clockinput was driven by an HP8012B pulse generator. Correctoperation was achieved up to 21.5 MHz; at higher frequencies,incorrect output sequences began to appear. The use ofnormal TTL in the counter chain gives a typical settling timeof 28.5 ns at 21.5 MHz; thus this result for standard PALsis consistent with the calculation for their typical delay inSection 3.1 (31ns).

4 Conclusions

The fast modular arbitration scheme described was designedto fit the tight constraints of a particular multiprocessorarchitecture. It can be applied to any form of shared bus,although pipelining arbitration is difficult in an asynchronousdesign. The scheme is efficient, both in terms of hardware and

ANY 16 0

PO *

PO'

ANY 161

Fig. 6C Level-3 arbitration

G30

G31

CoA-bitbinarycounter^

LBA

A

/

elk

GD Q -

> Q -

i

i

D Q

D Q

D Q

D Q

n o

> Q

PA

P3

PO

[— PO'

Fig. 6 0 Priority state generator

IEEPROC, Vol. 130, Pt. E, No. 2, MARCH 1983

Page 7: High-speed bus arbiter for multiprocessors

of bus bandwidth, and makes the shared bus a much moreattractive method of interconnection.

5 Acknowledgments

The author would like to thank O.J. Davies for his help andguidance in the preparation of this paper and in the con-tinuation of the project from which it comes. Also colleaguesM. Islam, F. Lombardi and V.O. Roda for stimulating dis-cussions.

> References

1 BOWER, B.A., BUHR, R.J.A., and CAVERS, J.K.: 'Why multiplemicroprocessors?'. Proceedings of international symposium on miniand micro computers, Montreal, 1977, pp. 283-287

2 HOENER, S., and ROEHDER, W.: 'Efficiency of a multi-microprocessor system with time-shared busses'. Proceedings of 3rdEuromicro symposium, Amsterdam, 1977, pp. 35-42

3 HENRY, J.S., LEWIS, G.R., and McCUNE, B.P.: 'BTI 8000 -homogenous general purpose multiprocessing', Proc. AFIPS Conf.,1979, 48, pp. 513-528

10*

S°S1Si3

level-3 section

10 1211 13S3 11 141

t l t l «.- I I ». IIUliaoliil 16T J7 Klobml 16T I? 14| 1ifl

14 15

1,3 r16

Fig. 7 Minimal test

a Arbiterb State generatorUl, U2: PAL 16L8; U3, U4: 74S260; U6: 74S00; US, U8, U9, U15:74S20;U10, U l l : 74S74;U12, U13, U14: 74161; U7: 74SS1

4 EDWARDS, D.G.B., LAVINGTON, S.H., and THOMAS, G.: 'TheMU5 multicomputer communication system', IEEE Trans., 1977,C 26, pp. 19-28

5 ANTONSSON. D., DANIELSON, P.E., MALMBERG, B.,MARTENSSON, A., and OHLSSON, T.: 'A 2 megabit randomaccess memory with 512 megabit/s data rate', Digital Processes,1979, 5, pp. 141-149

6 SWAN, R.J., FULLER, S.H., and SIEWIOREK D.P.: 'Cm* amodular multi-processor', Proc. AFIPS Conf., 1977, 46, pp. 637-644

7 DOERTOK, O., and FLEISCHAMMER, W.: 'The anomalousbehaviour of flip flops in synchronizer circuits', IEEE Trans.,1979, C-28, pp. 273-276

8 BIRKNER, J.: 'PAL programmable array logic handbook' (Mono-lithic Memories, 1978, 1st edn)

9 FAERBER, G.: 'A decentralized fair bus-arbiter'. Microprocess. &Microprogram., 1981, 7. pp. 32-36

10 "The TTL data book for design engineers' (Texas Instruments,1979, 4th edn)

11 CONWAY, L., and MEAD, C: 'Introduction to VLSI systems'(Addison Wesley, 1980)

7 Appendix: PAL programming

The general form of the logic equations which can be imple-mented is either a sum of products or an inverted sum ofproducts. For the PAL16L8 under consideration, these are:

Ot = I (9)

Hence each output is the inverted sum of up to eight productseach of the form

(10)

where /,- are inputs. Eqns. 4 only need to have the bracketsexpanded to be exactly in this form.

Once the equations are in the correct form, a data file forPALASM must be prepared, an example of which, for thePALI device, is shown in Fig. 8. Line 1 musfcontain the PALpart number starting in column 1, the rest is ignored. Similarly,line 2 must contain a pattern number, while line 3 contains a

PAL16L8PAT0003 AUTHOR A. KOVALESKI2'4 LINE ROTATING PRIORITY ACTIVE HIGH INPUTS AND OUTPUTS

IA0 IA1 IA2 IA3 IB0 IB1 IB2 IB3 K0 GNDOB0 OA3 OA2 OA1 OA0 VCC

+K07KTIA1/K0*KTlA1 + /K0-KTIA2K0*K1#IA1 + K0'KTlA2 + K0*K1*IA3

+/K0*K1'IA2K0'K1*IA2+ KO*KTIA3/K0*/K1*IA2 + /K0*/K1*IA3 + /K0*/K1*IA0+K0'K1*IA3/K0*/K1*IA3+/K0*/KTlA0K07KTIA3+ K0*/K1*IA0+ K0*/K1*IA1

+/K0*/K1*IA0K07KTIA0+ K07K07KTIA1/K0*K1*IA0 + /K0*KTlA1 +/K0*K1*IA2

+K0VKTIB1/K0*K1*IB1 +/K0*K1*IB2K0*K1*IB1 + K0'K1*IB2+K0*K1*IB3

K1 OB3 OB2

IF (VCO/OA0 =

IF(VCC)/OA1 =

OB1

/IA0+

IF (VCO/OA2 = /IA2

IF(VCC)/OA3 =+

/I A3+

IF (VCO/OB0 = /IB0

IF (VCO/OB1 = /IB1

IF (VCO/OB2 = /IB2

IF (VCO/OB3 = /IB3

+ /K0*K1*IB2K0*K1*IB2+ K0*K1*IB3/KO*/KTIB2 +/K07Kr iB3 +/K07Kr iB0+ K0*K1*IB3/K07KTIB3 +/K07Kr iB0K07K1*IB3+ K07KTIB0+ K07K17IB1

+ /K07KTIB0K07KTIB0+ K07KTIB1/K0*K1*IB0 + /K0*K1"IB1 +/K0*K1*IB2

F ig. 8 PALASM data file

IKE PROC, Vol. 130, Pt. E, No. 2, MARCH 1983 55

Page 8: High-speed bus arbiter for multiprocessors

title and line 4 is ignored. A list of symbols for the 20 pinsmust begin on line 5 in the order pin 1 through 20 in freeformat. This is then followed by the equations in free format,where

* denotes logical AND

+ denotes logical OR

/ (prefix) denotes logical NOT

and IF (product-term) controls the tristate of an output;IF (VCC) means always active, never floating.

The final full stop terminates the file, although descriptiveinformation may follow it and will be ignored by the program.

When run, PALASM produces a hexadecimal format filesuitable for a transfer program called DTIO which sends it toa DATA I/O System 19 PROM programmer.

Alex Kovaleski is a Ph.D" student in theDepartment of Electronic and ElectricalEngineering of University College, Lon-don. He received his B.Sc. degree inphysics from Kings College, London in1979. Research interests include micro-processor hardware and software andmultiprocessor architectures. He is anAssociate Member of the IEE.

Contents of Software & Microsystems

The contents are given below of the February 1983 issue of Software & Microsystems

Program support for Project UniverseSteve R.Wilbur

A microprocessor file serverChristopher Hudson

Applications reportsMicroprocessors in scanning microscopyA low-cost microcomputer-based seismicprocessing system

University MicrosProjects at the University of Essex

56 IEEPROC, Vol. 130, Pt. E, No. 2, MARCH 1983


Recommended