ULTRASPARC-III - UNLP

730272-1732/99/$10.00 1999 IEEE

The UltraSPARC-III is the third gen-eration of Sun Microsystems’ most powerfulmicroprocessors, which are at the heart of Sun’scomputer systems. These systems, rangingfrom desktop workstations to large, mission-critical servers, require the highest performancethat the UltraSPARC line has to offer. Thenewest design permits vendors the scalabilityto build systems consisting of 1,000+ Ultra-SPARC processors. Furthermore, the designensures compatibility with all existing SPARCapplications and the Solaris operating system.

The UltraSPARC-III design extends Sun’sSPARC Version 9 architecture, a 64-bit exten-sion to the original 32-bit SPARC architec-ture that traces its roots to the BerkeleyRISC-I processor.1 Table 1 (next page) listssalient microprocessor pipeline and physicalattributes. The UltraSPARC-III design targetis a 600-MHz, 70-watt, 19-mm die to be builtin 0.25-micron CMOS with six metal layersfor signals, clocks, and power.

Architecture design goalsIn defining the newest microprocessor’s

architecture, we began with a set of four high-level goals for the systems that would use theUltraSPARC-III processor. These goals wereshaped by team members from marketing,engineering, management, and operations inSun’s processor and system groups.

CompatibilityWith more than 10,000 third-party applica-

tions available for SPARC processors, compat-ibility—with both application and operatingsystem—is an essential goal and primary fea-ture of any new SPARC processor. PreviousSPARC generations required a corresponding,new operating system release to accommodatechanges in the privileged interface registers,which are visible to the operating system. Thisin turn required all third-party applications tobe qualified on the new operating system beforethey could run on the new processor. Main-taining the same privileged register interface inall generations eliminated the delay inherent inreleasing a new operating system.

Part of our compatibility goal includedincreasing application program perfor-mance—without having to recompile theapplication—by more than 90%. Further-more, this benefit had to apply to all applica-tions, not just those that might be a goodmatch for the new architecture. This goaldemanded a sizable microarchitecture perfor-mance increase while maintaining the pro-grammer-visible characteristics (such asnumber of functional units and latencies)from previous generations of pipelines.

PerformanceTo design high performance into the Ultra-

Tim Horel andGary Lauterbach

Sun Microsystems

EVERY DECISION HAS AT LEAST ONE ASSOCIATED TRADE-OFF. SYSTEM

ARCHITECTS ULTIMATELY ARRIVED AT THIS 64-BIT PROCESSOR DESIGN AFTER

A CHALLENGING SERIES OF DECISIONS AND TRADE-OFFS.

ULTRASPARC-III:Designing Third-Generation 64-Bit Performance

SPARC-III, we believed we needed—andhave—a unique approach. Recent researchshows that the trend for system architects is todesign ways of extracting more instruction-levelparallelism from programs. In considering manyaggressive ILP extraction techniques for theUltraSPARC-III, we discovered that they share

a common undesirable characteristic—thespeedup varies greatly across a set of programs.

Relying on ILP techniques for most of theprocessor’s performance increase would notdeliver the desired performance boost. ILPtechniques vary greatly from program to pro-gram because many programs or program sec-tions use algorithms that are serially datadependent. Figure 1 shows an example of aserially data-dependent algorithm.

In a high-performance processor such as theUltraSPARC-III, several iterations of the loopcan concurrently execute. Figure 1 shows threeiterations overlapped in time. The time ittakes these three iterations to execute dependson the latency of the load instruction. If theload executes with a single-cycle latency, thenthe maximum overlap occurs, and the proces-sor can execute three instructions each cycle.As the load latency increases, the amount ofILP overlap decreases, as Table 2 shows.

Many ILP studies have assumed latenciesof 1 for all instructions, which can cause mis-leading results. In a nonacademic machine,the load instruction latency is not a constantbut depends on the memory system’s cachehit rates, resulting in a fractional average laten-cy. The connection between ILP (or achievedILP, commonly referred to as instructions percycle—IPC or 1/CPI) and operation latencymakes these units cumbersome to analyze fordetermining processor performance.

One design consideration was an averagelatency measurement of a data dependencychain (ending at a branch instruction) for theSPEC95 integer suite. The measurement wasrevealing: The average dependent chain inSPEC95 consisted of a serial-data-dependentchain with one and a half arithmetic or logi-cal operations (on average, half of a loadinstruction), ending with a branch instruc-tion. A simplified view is that SPEC95 inte-ger code is dominated by load-test-branchdata dependency chains.

We realized that keeping the executionlatency of these short dependency chains lowwould significantly affect the UltraSPARC-III’s performance. Execution latency is anoth-er way to view the clock rate’s profoundinfluence on performance. As the clock ratescales, all the bandwidths (in operations perunit time) and latencies (in time per opera-tion) of a processor scale, proportionately.

74

ULTRASPARC-III

IEEE MICRO

Table 1. UltraSPARC-III pipeline and physical data.

Pipeline feature Parameter

Instruction issue 4 integer2 floating-point 2 graphics

Level-one (L1) caches Data: 64-Kbyte,4-wayInstruction: 32-Kbyte,4-wayPrefetch: 2-Kbyte,4-wayWrite: 2-Kbyte,4-way

Level-two (L2) cache Unified (data and instruction): 4- and 8-Mbyte, 1-way

On-chip tagsOff-chip data

Physical feature Parameter

Process 0.25-micron CMOS, 6 metal layers

Clock 600+ MHz

Die size 360 mm2

Power 760 watts @1.8 volts

Transistor count RAM: 12 million

Logic: 4 million

Package 1,200-pin LGA

Table 2. Load latency increases as ILP decreases.

Instruction-level parallelism Load latency

(instructions per cycle) (cycles)

0.75 4 1.00 3 1.50 2 3.00 1

Time Iteration 1

Loop: Id [r0], r0 tst r0 bne Loop

Iteration 2


Iteration 3


Figure 1. A serially data-dependent algorithm example, which is a simplesearch for the end of a linked-list data structure.

Bandwidth (ILP) alone cannot provide aspeedup for all programs; it’s only by scalingboth bandwidth and latency that performancecan be boosted for all programs.

Our focus thus became to scale up thebandwidths while simultaneously reducinglatencies. This goal should not be viewed sim-ply as raising the clock rate. It’s possible tosimply raise the clock rate by deeper pipelin-ing the stages but at the expense of increasedlatencies. Each time we insert a pipeline stage,we incur an additional increment in the clock-ing overhead (flop delay, clock skew, clock jit-ter). This forces less actual work to be doneper cycle, thus leading to increased latency (inabsolute nanoseconds). Our goal was to pushup the clock rate while at the same time scal-ing down the execution latencies (in absolutenanoseconds).

ScalabilityThe UltraSPARC-III is the newest genera-

tion of processors that will be based on thedesign we describe in this article. We designedthis processor so that as process technologyevolves, it can realize the full potential of futuresemiconductor processes. Scalability was there-fore of major importance. As an example,research at Sun Labs indicates that propaga-tion delay in wiring will pose increasing prob-lems as process geometries decrease.2 We thusfocused on eliminating as many long wires aspossible in the architecture. Any remaininglong wires are on paths that allowed cycles tobe added with minimum performance impact.

Scalability also required designing the on-chip memory system and the bus interface tohandle multiprocessor systems to be built withfrom two to 1,000 UltraSPARC-III processors.

ReliabilityA large number of UltraSPARC-III proces-

sors will be used in systems such as transac-tion servers, file servers, and compute servers.These mission-critical systems require a highlevel of reliability, availability, and service-ability (RAS) to maximize system uptime andminimize the time to repair when a failuredoes occur.

One of our goal requirements was to detectas many error conditions as possible. In addi-tion, we added three more guidelines toimprove system RAS:

• Don’t allow bad data to propagate silently.For example, when the processor sourc-ing data on a copy-back operation detectsan uncorrectable cache ECC error, it poi-sons the outgoing data with a unique,uncorrectable ECC syndrome. Any otherprocessor in a multiprocessor system willthus get an error if it touches the data.The sourcing processor also takes a trapwhen the copy-back error is detected tofulfill the next guideline, below.

• Identify the source of the error. To minimizedowntime of large multiprocessor systems,the failing replaceable unit must be cor-rectly identified so that a field techniciancan quickly swap it out. This requires theerror’s source to be correctly identified.

• Detect errors as soon as possible. If errors arenot quickly detected, identifying the error’strue source can become difficult at best.

Major architectural unitsThe processor’s microarchitecture design

has six major functional units that performrelatively independently. The units commu-nicate requests and results among themselvesthrough well-defined interface protocols, asFigure 2 shows.

Instruction issue unitThis unit feeds the execution pipelines with

instructions. It independently predicts thecontrol flow through a program and fetchesthe predicted path from the memory system.Fetched instructions are staged in a queuebefore forwarding to the two execution units:integer and floating point. The IIU includesa 32-Kbyte, four-way associative instructioncache, the instruction address translationbuffer, and a 16 K-entry branch predictor.

Integer execute unitThis unit executes all integer data type

instructions: loads, stores, arithmetics, logi-cals, shifts, and branches. Four independentdata paths enable up to four integer instruc-tions to be executed per cycle. The allowableper-cycle integer instruction mix is as follows:

• 2 from (arithmetic, logical, shift), A0/A1pipelines

• 1 from (load, store), MS pipeline• 1 from (branch), BR pipeline

75MAY–JUNE 1999

The load/store pipeline alsoexecutes floating-point datatype memory instructions. Asecond floating-point datatype load instruction can beissued to either of the A0 orA1 pipelines. We describethis instruction in more detailin the prefetch cache discus-sion, later as part of the on-chip memory section.

Data cache unit (on-chipmemory system)

The data cache unit com-prises the level-one (L1) on-chip cache memories and thedata address translation

76

ULTRASPARC-III

IEEE MICRO

L1 DataL1 PrefetchL1 WriteL2 External

2 cycles3 cycles1 cycle12 cycles

9.6 Gbytes/s18.4 Gbytes/s13.6 Gbytes/s 6.4 Gbytes/s

Memory storeload data pipe

A0/AI integerexecution pipe

L1 data cache

Prefetch cache

Store queue

Memory storepipe address

A0/AI integerexecution

pipe address

Writecache

64 64 64

258

Off-chip

L2cache

Cache Latency Bandwidth

Figure 3. Data cache unit block diagram.

Instruction issue unit (IIU)

Floating-point unit (FPU)

Data cache unit (DCU)

External memory unit (EMU)

External cacheSRAM

Local memory

SDRAM

FP multiply

FP add/subtract

FP divideGraphics unit

Instruction cache

Instruction queue

Steering logic

4 instructions

Datacache

Prefetchcache

Writecache

Storequeue

SDRAMcontroller

Externalcachetags

SRAMcontroller

288

Integer execution unit (IEU)

Dependency/trap logic

ALU pipes (0 and 1)

Load/store/special pipe

System interface unit (SIU)

Snoop pipecontroller

Data switchcontroller

Systeminterconnect

144

Floating-point

registerfile

(FPRF)

Workingand

architecturalregister file

(WARF)

Figure 2. Communication paths between the UltraSPARC-III’s six major functional units.

buffer, as Figure 3 shows. There are three first-level, on-chip data caches: data—64-Kbyte,four-way associative, 32-byte line; prefetch—2-Kbyte, four-way associative, 64-byte line;and write—2-Kbyte, four-way associative, 64-byte line.

Floating-point unitThis unit contains the data paths and con-

trol logic to execute all floating-point and par-titioned fixed-point data type instructions.Three data paths concurrently execute float-ing-point or graphic (partitioned fixed-point)instructions, one each per cycle from the fol-lowing classes:

• Divide/multiply (single or double preci-sion or partitioned),

• Add/subtract/compare (single or doubleprecision or partitioned), and

• An independent division data path,which lets a nonpipelined divide proceedconcurrently with the fully pipelinedmultiply and add data paths.

External memory unitThis unit controls the two off-chip memo-

ry structures: the level-two (L2) data cache builtwith off-chip synchronous RAMs (SRAMs),and the main memory system built with off-chip synchronous DRAMs (SDRAMs).

The L2 cache controller includes a 90-Kbyteon-chip tag RAM to support L2 cache sizes upto 8 Mbytes. The main memory controller cansupport up to four banks of SDRAM memo-ry totaling 4Gbytes of storage.

System interface unitThis unit handles external communication

to other processors, memory systems, and I/Odevices. The unit can handle up to 15 out-standing transactions to external devices, withsupport for full out-of-order data delivery oneach transaction.

Instruction pipelineTo meet our clock rate and performance

goals, we concluded that we needed a deeppipeline. The UltraSPARC-III 14-stagepipeline, as Figure 4 shows, has more stages

77MAY–JUNE 1999

k

Decode

Sco

rebo

ard

depe

nden

cy

Ste

erin

g

Queue

Instructiondecode

Instructionpredecode

Instructiontranslation

buffer

Branchpredictor

(16Kentry)

Branch target add

W

ARF

W

A

R

F

A0arith log

shift

A1arith log

shift

Data cache(64-Kbyte, 4-way)

Prefetch cache(2-Kbyte, 4-way)

Data translationbuffer

Store queue

ASU

FPregister

file

Bypassnetwork

Instructioncache

(32-Kbyte, 4-way,

32-byte line)

Add

ress

gene

ratio

n

Inst

ruct

ion

pref

etch

Inst

ruct

ion

fetc

h

Bra

nch

targ

etca

lcul

atio

n

Inst

ruct

ion

deco

de

Reg

iste

rfil

e re

ad

Inte

ger

exec

ute

Dat

aca

che

acce

ss

Mem

ory

bypa

ss

Pip

eex

tend

Trap

Don

e

Wor

king

regi

ster

file

writ

e

Inst

ruct

ion

stee

r

Instructionissue logic

Mux

Branch pipeline

Writecache

(2-Kbyte,4-way)

Branch pipe

Memorystore pipe

A0 integerexecution pipe

A1 integerexecution pipe

Floating-pointdivide/square root

Floating-point/graphics multiplier

Floating-point/graphics adder

integer multiplydivide

A P F B I J R E C M W X T D

WARF

FPASU

Working andarchitectural register fileFloating pointArithmetic special unit

Figure 4. The UltraSPARC-III microprocessor instruction pipeline.

than any previous UltraSPARC pipeline. Theextra pipeline stages must be added in pipelinepaths that are infrequently used—for exam-ple, trap generation.

Each pipeline stage performs part of the worknecessary to execute an instruction, as the box,“How pipeline stages work in the UltraSPARC-III,” explains. The instruction issue unit occu-pies the A through J stages of the pipeline, andthe integer execution unit accounts for the Rthrough D stages. The data cache unit occupiesthe E, C, M, and W stages of the pipe in paral-lel with integer execution unit stages. The float-ing-point unit is shown as a side pipeline thatparallels the E through D stages of the integerpipeline. The other units of the machine (sys-tem interface unit and external memory unit)have internal pipelines but are not consideredpart of the core processor pipe.

We determined the processor’s pipelinedepth early in the design process by analyzingseveral basic paths. We selected the integerexecution path to determine the machinecycle time so we would have minimum laten-cy for the basic integer operation. Using an

aggressive dynamic adder for this stage result-ed in our setting the number of logic gate lev-els per stage to approximately eight—the exactnumber depends on circuit style.

Early analysis also showed that with eightgate delays (using a three-input NAND witha fan-out of three as a gate delay) per stage,the overhead due to synchronous clocking(from flip-flop delay, clock skew, jitter, and soon) would consume about 30% of the cycle.If we tried to pipeline the integer executionover two cycles (commonly called super-pipelining), the second 30% clocking over-head would significantly increase latency. Asa result, performance would decline in someapplications. The on-chip cache memories arepipelined across two stages, but they don’t suf-fer the additional clock overhead because weused a wave-pipeline circuit design.

Another known critical path from previousSPARC designs is the global pipe stall signal.This signal freezes the flip-flops when anunexpected event, such as a data cache miss,occurs. This freeze signal is dominated by wiredelay that we knew would have technologyscaling problems, so we decided to complete-ly eliminate it by using a nonstalling pipeline.Since the pipeline state couldn’t be frozen, wehad to use a repair mechanism that couldrestore the state when an unexpected eventoccurs. It’s handled like a trap: The pipeline isallowed to drain, and its state is restored byrefetching instructions that were in thepipeline, starting at the A stage.

One concern with a deep pipeline is the costof branch misprediction. When a branch is mis-predicted, instructions must be refetched start-ing at the A stage. This incurs a penalty of eightcycles (A through E stages). With recentimprovements in branch prediction, a processorincurs this penalty much less frequently, allow-ing the pipeline to be longer with only a smallperformance cost. In addition, we designed asmall amount of alternate path buffering in theI stage (the miss queue). If a predicted takenbranch thus mispredicts (actually not taken), afew instructions are immediately available tostart in the I stage. This effectively halves thebranch misprediction penalty.

Pipeline stages after the M stage impact per-formance whenever the pipeline must bedrained. We overlapped the new fetch (for atrap target or a refetch) with the back of the

78

ULTRASPARC-III

IEEE MICRO

How pipeline stages work in the UltraSPARC-IIIStage Function

A Generate instruction fetch addresses, generate predecoded instruction bits oncache fill

P Fetch first cycle of instructions from cache; access first cycle of branch predic-tion

F Fetch second cycle of instructions from cache; access second cycle of branch pre-diction; translate virtual-to-physical address

B Calculate branch target addresses; decode first cycle of instructionsI Decode second cycle of instructions; enqueue instructions into the queueJ Steer instructions to execution unitsR Read integer register file operands; check operand dependenciesE Execute integers for arithmetic, logical, and shift instructions; read, and check

dependency of, first cycle of data cache access floating-point register fileC Access second cycle of data cache, and forward load data for word and double-

word loads; execute first cycle of floating-point instructionsM Load data alignment for half-word and byte loads; execute second cycle of float-

ing-point instructionsW Write speculative integer register file; execute third cycle of floating-point instruc-

tionsX Extend integer pipeline for precise floating-point traps; execute fourth cycle of

floating-point instructionsT Report trapsD Write architectural register file

pipeline draining. By doing so, we could addstages to the back of the pipe, as long as wecould guarantee that the pipe results weredrained before new instructions reached theR stage. To ease the implementation of pre-cise exceptions, we added back-end pipe stagesup to this limit.

We pushed back the floating-point execu-tion pipeline by one cycle relative to the inte-ger execution pipe. This allows extra time tothe floating-point unit for wire delays. We hadto keep the machine’s latency-sensitive inte-ger part physically small to minimize wiredelays. Moving the floating-point unit awayfrom the integer core was a major step towardachieving this goal.

Instruction issue unitExperience with previous UltraSPARC

pipelines showed our design teams that manycritical-timing paths occurred in the instruc-tion issue unit. Consequently, we knew wehad to pay particular attention to designingthis part of the processor. Our decision to keepUltraSPARC-III a static speculation machinecompatible with the previous pipelines paidoff in the IIU design. Dynamic speculationmachines require very high fetch bandwidthsto fill an instruction window and find instruc-tion-level parallelism. In a static speculationmachine, the compiler can make the specu-lated path sequential, resulting in fewerrequirements on the instruction fetch unit.We used this static speculation advantage tosimplify the fetch unit and minimize the num-ber of critical timing paths. Figure 5 illustratesthe IIU’s different blocks.

The pipeline’s A stage corresponds to theaddress lines entering the instruction cache.All fetch address generation and selectionoccurs in this pipe stage. Also at the A stage,a small, 32-byte buffer supports sequentialprefetching into the instruction cache. Whenthe instruction cache misses, the cache linerequires 32 bytes. But instead of requestingonly the 32 bytes needed for the cache, theprocessor issues a 64-byte request. The first32 bytes fill the cache line; the second 32 bytesare stored in the buffer. The buffer can thenbe used to fill the cache if the next sequentialcache line also misses.

We distributed the instruction cache accessover two cycles (P and F pipeline stages) by

using a wave-pipelined SRAM design. In thisdesign, the cache is pipelined without the useof latches or flip-flops. Careful circuit designensures that the data waves in each cycle do notovertake each other.3 In parallel with the cacheaccess, this design also allows branch predictorand instruction address translation bufferaccess. By the time the instructions are avail-able from the cache in the B stage, we also havethe physical address from the translator and aprediction for any branch that was fetched. Theprocessor uses all this information in the B stageto determine whether to follow a sequential-or taken-branch path. The processor also deter-mines whether the instruction cache access wasa hit or a miss. If the processor predicts a takenbranch in the B stage, the processor sends backthe target address for that branch to the A stageto redirect the fetch stream.

Waiting until the B stage to redirect thefetch stream lets us use a large, accurate branchpredictor. We minimized the wait penalty forbranch targets through compiler static spec-ulation and the instruction buffering queues.

The branch predictor uses a Gshare algo-rithm4 with 16K 2-bit saturating up/downcounters. Since the predictor is large, it need-ed to be pipelined across two stages. In theoriginal Gshare scheme, this would requirethe predictor to be indexed with an old orinaccurate copy of the program counter (PC).

We modified the scheme by offsetting thehistory bits such that the three low-order indexbits into the predictor use PC informationonly. Each time the predictor is accessed, eightcounters are read out. Later, one of them is

79MAY–JUNE 1999

Instructiontranslation

look-aside buffer

32-Kbyteinstruction cache

Branchpredictor

Returnaddress stack

Missqueue

Instructionqueue

4

4

+

Mux

Mux

Figure 5. Instruction issue unit (IIU) block diagram.

selected (using the three low-order PC bits) inthe pipeline’s B stage after the exact positionof the first branch in the fetch group is known.

Simulations showed that not XORing theglobal-history bits with the low-order PC bitsdoes not affect the branch predictor perfor-mance. The three-cycle loop through the A, P,and F stages for taken branches lets us keepthe global-history register at the predictor’sinput in an exact state. The register is exactbecause there can only be one taken branchevery three cycles.

We designed two instruction bufferingqueues into the UltraSPARC-III: the instruc-tion queue and the miss queue. The 20-entryinstruction queue decouples the fetch unitfrom the execution units, allowing each to pro-ceed at its own rate. The fetch unit is allowedto work ahead, predicting the execution pathand stuffing instructions into the instructionqueue until it’s full. When the fetch unitencounters a taken branch delay, we lose twofetch cycles to fill the instruction queue. Oursimulations show, however, that there are usu-ally enough instructions already buffered inthe instruction queue to occupy the executionunits. This two-cycle delay also gives us theopportunity to buffer the sequential instruc-tions that have already been accessed into thefour-entry miss queue. If we then find that wemispredicted the taken branch, the instruc-tions from the miss queue are immediatelyavailable to send to the execution units.

The last two stages of the instruction issueunit decode the instruction type and steer eachinstruction to the appropriate execution unit.These two functions must be done in separatepipeline stages to achieve the cycle time goal.

Integer execute unitWe guarantee the minimum logical latency

for the most frequent instructions by settingthe processor cycle time with the integer exe-cute stage. However, the amount of work wetry to fit in one execute stage cycle varies overa wide range. We used several techniques tominimize the cycle time of the E stage. Weapplied the most aggressive circuit techniquesavailable to design the E stage—the entire inte-ger data path uses dynamic precharge circuits.We had to carefully design the physical datapath to minimize wiring lengths; wire delaycauses more than 25% of the delay in this stage.

This level of design cannot be applied tothe entire processor. It was vital that themicroarchitecture clearly showed where thissort of design investment would pay off inperformance.

We extended the future file method to helpachieve a short cycle time.5 The working andarchitectural register file (WARF) let us removethe result bypass buses from most of the inte-ger execution pipeline stages. Without bypassbuses, we could shorten the integer data pathand narrow the bypass multiplexing. Both con-tribute to a short cycle time.

The WARF can be regarded as two separateregister files. The processor accesses the work-ing register file in the pipeline’s R stage andsupplies integer operands to the executionunit. The file is also written with integerresults as soon as they become available fromthe execution pipeline. Most integer opera-tions complete in one cycle, with resultsimmediately written into the working regis-ter file in the pipeline’s C stage. If an excep-tional event occurs, the immediately writtenresults must be undone. Undoing results isaccomplished with a broadside copy of allinteger registers from the architectural regis-ter file back into the working register file. Byplacing the architectural register file at the endof the pipe, we can ensure that we do not com-mit results into it until we have resolved allexceptional conditions. Copying the archi-tectural register file back into the working reg-ister file gives us a simple, fast way to repairthe pipeline state when exceptions do occur.

The state copying of the WARF also offersa simple mechanism to implement theSPARC architecture register windows. Thearchitectural register file contains a full eightwindows’ worth of integer registers. A broad-side copy into the working register file of thenew window is made when a window must bechanged.

We moved the data path for the least fre-quently executed integer instructions to a sep-arate location to further unburden the coreinteger execution pipeline from extra wiring.Nonpipelined instructions such as integerdivide are executed in this data path, which iscalled the arithmetic/special unit (ASU). TheASU was decoupled from impacting themachine cycle time by dedicating a full cycleeach way to get operands to and from this unit.

80

ULTRASPARC-III

IEEE MICRO

On-chip memory systemThe performance influence

of the memory systembecomes increasingly domi-nant as processor performanceand clock rates increase. Forthis reason, the on-chip mem-ory system was crucial to ourdelivering UltraSPARC-III’sperformance and scalabilitygoals. In designing the on-chip memory system, we followed this princi-ple: Achieve uniform performance scaling byscaling both bandwidth and latency. A popu-lar architectural trend is to try to hide thememory latency scaling problem by using pro-gram ILP. Not surprisingly, the hiding is notfree: Programs with low ILP suffer a perfor-mance hit, and ILP that could have been usedto speed up the program execution was wast-ed on “hiding” the lagging memory system.Table 3 summarizes the UltraSPARC-III on-chip memory system.

The key to scaling memory latency in theUltraSPARC-III is the first-level, sum-addressed memory data cache.6 Fusing thememory address adder with the word linedecoder for the data cache largely eliminatesthe address adder’s latency. This enabled us toincrease the data cache size to completely occu-py the time available in two processor cycles.The combination of an increased cache size

with a scaled clock rate while maintaining atwo-cycle access gives us a linear memorylatency improvement. We can demonstratethis linear improvement with the following cal-culation of overall memory latency:

average latency = L1 hit time + L1 miss rate * L1miss time + L2 miss rate * L2 miss time

Table 4 shows latency trade-offs, which weachieved with some representative values fromsimulations of the SPEC integer benchmarksto compare the UltraSPARC-II and Ultra-SPARC-III.

Comparing the 300-MHz UltraSPARC-IIwith the 600-MHz UltraSPARC-III showsthat we were able to scale the average memo-ry latency by more than the clock ratio. Weachieved this result through the use of thesum-addressed memory (SAM) cache andimprovements in the L2 cache and memory

81MAY–JUNE 1999

Table 3. UltraSPARC-III’s on-chip memory system parameters.

Cache Size (Kbytes) Associativity Line length (bytes) Protocol

Instruction 32 4-way, microtag 32 Store coherentpseudorandom

Data 64 4-way, microtag 32 Write-throughpseudorandom

Write 2 4-way, LRU 64 Write-validatePrefetch 2 4-way, LRU 64 Store coherent

Table 4. Memory latency trade-offs. US-II is the UltraSPARC-II; US-III is the UltraSPARC-III. SAM is sum-

addressed memory. All L2 caches are 4-Mbyte, direct-mapped.

L1 miss L-2 miss

rate rate Average

L1 cache Load use (fraction L1 miss (fraction memory

L1 data latency penalty per load rate cost per load L2 miss latency

Clock cache (cycles) (cycles) instruction) (ns) instruction) cost (ns) (ns)

US-II300 MHz 16-Kbyte, 1-way 2 1 0.10 30 0.01 150 11.16600 MHz 16-Kbyte, 1-way 2 1 0.10 20 0.01 100 6.33600 MHz 64-Kbyte, 4-way 3 2 0.04 20 0.01 100 6.80US-III600 MHz 64-Kbyte, 4-way, SAM 2 1 0.04 20 0.01 100 5.13

[estimatedfor designpurposes]

latencies.6 The alternative UltraSPARC-IIIapproaches could not keep up with the clockrate scaling even with L2 cache and memorylatency reductions.

The SAM cache works well for programshaving reasonable cache hit rates, but we want-ed the performance of all programs to scale. Forprograms dominated by main memory laten-cy, we use two techniques: a prefetch cache thatis accessed in parallel with the L1 data cache,and a low-latency, on-chip memory controller(described later). Analysis showed that manyprograms dominated by main memory laten-cy shared a common characteristic: the abilityto prefetch the memory data well before it’sneeded by the execution units.

By issuing up to eight in-flight prefetchesto main memory, the prefetch cache enables aprogram to utilize 100% of the available mainmemory bandwidth without incurring a slow-down due to the main memory latency. Theprefetch cache is a 2-Kbyte SRAM organizedas 32 entries of 64 bytes and using four-wayassociativity with an LRU replacement policy.A multiport SRAM design let us achieve avery high throughput. Data can be streamedthrough the prefetch cache in a manner sim-ilar to stream buffers.7,8 On every cycle, eachof two independent read ports supply 8 bytesof data to the pipeline while a third write portfills the cache with 16 bytes.

Other microprocessors, such as the Ultra-SPARC-II, implement prefetch instructions.Our simulations, however, show that prefetch-ing’s full benefit is not realized without thehigh-bandwidth streaming afforded by thethree ports of the prefetch cache. We alsoincluded an autonomous stride prefetchengine that tracks the program counters ofload instructions and detects when a loadinstruction is striding through memory.When the prefetch engine detects a stridingload, the prefetch engine issues a hardwareprefetch independent of any softwareprefetch. This allows the prefetch cache to beeffective even on codes that do not includeprefetch instructions.

Our next challenge was to scale the on-chipmemory bandwidths. We solved this largelyby using two techniques: wave-pipelinedSRAM designs for the on-chip caches, and awrite cache for store traffic. Wave-pipeliningof the caches let us decouple the on-chip

memory bandwidth from the latency andindependently optimize each characteristic.

Write-caching is an excellent way to reducethe bandwidth due to store traffic.9 In theUltraSPARC-III we use a write cache to reducethe store traffic bandwidth to the off-chip L2data cache. The write cache provides otherbenefits: By being the sole source of on-chipdirty data, the write cache easily handles bothmultiprocessor and on-chip cache consisten-cy. Error recovery also becomes easier with thewrite cache, since the write cache keeps allother on-chip caches clean and simply invali-dates them when an error is detected.

Sharing a 2-Kbyte SRAM design with theprefetch cache conserved our design effort.Also, it was practical: Simulations showed thatthe write-back bandwidth of the write cachewas relatively insensitive to its size once it waslarger than 512 bytes. The bandwidth reduc-tion at 2 Kbytes was equivalent to the storetraffic from a write-back, 64-Kbyte, four-wayassociative data cache. Over 90% of the timethe write cache can merge a store into an exist-ing dirty write-cache line.

We use a byte validate policy on the writecache. Rather than reading the data from theL2 cache for the bytes within the line that arenot being overwritten, we just keep an individ-ual valid bit for each byte. Not performing theread-on-allocate saves considerable L2 cachebandwidth by postponing a read-modify-writeuntil the write cache evicts a line. Frequently,by eviction time the entire line has been writtenso the write cache can eliminate the read. Weincluded the write cache in the L2 data cache,and write-cache data can supersede read datafrom the L2 data cache. We handle this by abyte-merging multiplexer on the incoming L2cache data bus that can choose either write-cache data or L2 cache data for each byte.

The last benefit of the write cache is inimplementing the V9 memory ordering rules.The V9 architecture specifies a memory totalstore ordering that simplifies the writing ofhigh-performance multiprocessor programs.This model requires that store operations bemade visible to all other processors in a mul-tiprocessor system in the original programorder. The write cache provides the point ofglobal store visibility in UltraSPARC-III sys-tems. Generally, keeping the requirements ofstores (bandwidth, error correction, ordering,

82

ULTRASPARC-III

IEEE MICRO

consistency) in a separate cache lets us inde-pendently optimize both parts of the on-chipmemory system.

Floating-point unitTo meet the cycle time goals for the Ultra-

SPARC-III, we made a concession to latencyscaling in the floating-point execution units.Early circuit analysis showed that by usingadvanced dynamic circuit design, we need addonly one additional latency cycle to the float-ing-point add and multiply units. For numer-ical floating-point programs, the impact ofadditional execution latency concerned us less.We were less concerned because previousUltraSPARC generations encouraged unrolledloops to be scheduled for the L2 cache laten-cy, which was eight cycles. Since the previouspipelines had modulo scheduled loops at amultiple of our new latencies, the code sched-ules would be compatible.

We scaled the floating-point divide latency(in absolute nanoseconds) by using a multi-plicative iteration algorithm. Table 5 summa-rizes the characteristics of the UltraSPARC-IIIfloating-point execution units and comparesthem to the UltraSPARC-II latencies.

External memory and system bus interfaceThe UltraSPARC-III external memory sys-

tem includes a large L2 data cache and themain memory system. Integrating, on chip,the control of both these external memory sys-tems was essential in achieving our perfor-mance, scalability, and reliability goals.

We built the L2 data cache with eight indus-try-standard, register-to-register, pipelined sta-tic memory chips cycling at one-third of theprocessor clock rate. The cache controllerallows programmable support of 4 or 8 Mbytesof L2 cache. The L2 cache controller accessesoff-chip L2 cache SRAMs with a 12-cyclelatency to supply a 32-byte cache line to theL1 caches. A 256-bit-wide data bus betweenthe off-chip SRAMs and the microprocessordelivers the full 32 bytes of data needed for anL1 miss in a single SRAM cycle. By placingthe tags for the L2 cache on chip, we reducedthe latency to main memory with early detec-tion of L2 misses. On-chip tags also enablederivative future designs to build associativeL2 caches without a latency penalty. The L2cache controller accesses on-chip tags in par-

allel with the start of the off-chip SRAM accessand can provide a way-select signal to a late-select address pin on the off-chip data SRAMs.

Dedicating every other cycle of the on-chipL2 cache tags to coherency snoops from otherprocessors provides excellent coherency band-width, since the tag SRAM is wave-pipelinedat the full 600-MHz target clock rate.

Moving the main memory DRAM con-troller on chip reduces memory latency, com-pared to the previous generation, and scalesmemory bandwidth with the number ofprocessors. The memory controller supportsup to 4 Gbytes of SDRAM memory organizedas four independent banks. In a multiproces-sor system, the SDRAM banks can be inter-leaved across the per-processor memorycontrollers. By sizing the SDRAM data bus tobe the same as the coherency unit (512 bits),we can minimize the latency to complete a datatransfer from memory. This can have a signif-icant performance effect since misses fromlarge caches tend to cluster, and contention forthe memory bus from adjacent misses canimpact performance. The memory controllerhas a peak 3.2-Gbyte/sec transfer rate.

We maximized the bandwidth of theprocessor interface to the system bus by allow-ing up to 15 outstanding bus transactionsfrom each processor. The outstanding trans-actions can complete out of order with respectto the original request order. This enables thememory banks in a multiprocessor system toservice requests as soon as a bank is available.The processor’s bus interface takes care of re-ordering the data delivery to the pipeline tomeet the requirements of the SPARC V9memory programming model.

The system bus interface architecture waskey to meeting the reliability goals we have stat-ed. All processor interfaces use error detection

83MAY–JUNE 1999

Table 5. UltraSPARC-III (US-III) floating-point units compared

to UltraSPARC-II (US-II) latencies.

600-MHz 300-MHz

US-III latency US-III US-II latency

Operation (cycles, ns) issue rate (cycles, ns)

Add/subtract 4, 6.66 1 per cycle 3, 9.99Multiply 4, 6.66 1 per cycle 3, 9.99Divide 20, 33.4 1 per 17 cycles 22, 72.6Square root 24, 40.0 1 per 21 cycles 23, 76.0

and/or correction codes to detect errors as soonas possible. The processor performs error detec-tion on every external chip-to-chip hop to cor-rectly isolate any fault to its source. We alsodesigned an 8-bit-wide “back-door” bus thatruns independently from the main system bus.If the system bus has an error, each processorcan boot up and run diagnostic programs overthe back-door bus to diagnose the problem.

To enable the scaling of multiprocessor sys-tems based on UltraSPARC-III up to 1,000processors, we included support for in-mem-ory coherency directories on chip. CheckingECC over a 144-bit memory word instead of72 bits freed up 7 bits of each 144-bit mem-ory word for use by the in-memory directory.The processor must examine the directorystate only on each memory access to see if anexternal agent is required to intervene to com-plete the memory access. Placing the direct-ory in main memory allows it to automaticallyexpand as memory is added to the system.

Physical designPhysical design grows more challenging

with each new processor generation. As thelogic complexity grows, circuit counts andrelated interconnection increases. With inter-connections becoming increasingly dominantin design, the block designs must buffer theirinputs and outputs, very much like I/O cellsfor ASIC designs but without the level trans-lation. As clock rates rise faster than availablespeed increases in base CMOS technology,increasing individual gate complexity with-

out increasing gate delay becomes moreurgent. With clock rates, gate count, and totalwiring increasing, chip power increases, forc-ing additional changes in thermal manage-ment and electrical distribution schemes. Achip plot outlining the major functional unitsis shown in Figure 6.

The UltraSPARC-III is flip-chip (solderbump) attached to a multilayered ceramicland grid array package having 750 I/O sig-nals and 450 power bumps. The package hasa new cap to mate with an air-cooled heat sinkcontaining a heat pipe structure to control thedie temperature. A continuous double gridsystem on metal layers 5 and 6 provides allthis power. This paired grid reduces the powersupply loop inductance on the die and pro-vides return current paths for long signalwiring. The grid elements in the sixth metallayer are strapped with what amount to busbars on metal layer 7. This evens out thepower supply resistance drops that would oth-erwise be seen by the blocks.

The heavy-duty grid concept extends toclock distribution as well. A single distributedclock tree contains a completely shielded gridto reduce both jitter-inducing injected noiseand global skew. Each block also has its ownshorted, shielded, and buffered clock, furtherreducing the blocks’ local skews.

The circuit methodology we employed isprimarily fully static CMOS to simplify theverification requirements for much of thedesign. Only where speed requirements dic-tate higher performance did we use dynamic,or domino, designs. Also for verification ease,we placed the dynamic signals only withinfully shielded custom cells. To improve thespeed enhancement obtained with the dynam-ic circuits without further increasing theirpower, we used an overlapping, multiphased,nonblocking clock scheme, similar to thatdescribed by Klass.10

With clock rates continuing to increase, thepart of the cycle time allocated to flip-flopscomes under great pressure. To improve this,we designed a new edge-triggered flip-flop.11

This partially static output, dynamic inputflip-flop does not require setup time and is oneof the lowest D-to-Q delays for the power andarea in use today. The flip-flop design’s dynam-ic input stage effectively allows us to tuck in afull logic stage without increasing the D-to-Q

84

ULTRASPARC-III

IEEE MICRO

IIU Instruction issue unitIEU Instruction execute unitFGU Floating-point unitDCU Data cache unitSIU System interface unit

Figure 6. The UltraSPARC-III’s major functional units.

delay. The noise immunity increases by aninput shutoff mechanism that reduces theeffective sample time, allowing the design tobe used as though it were fully static.

To improve our ability to wire the proces-sor globally, we used an area-based router. Thisenabled reuse of any block area not needed inthe design’s lower level for additional top-levelwiring. Cost functions for global routing basedon noise and timing analysis let us define a spe-cific wire group’s width and spacing. Similar-ly, signal routing within the blocks lets usimprove timing and noise margins. MICRO

The UltraSPARC-III is the newest gener-ation of 64-bit SPARC processors to be

used in a wide range of Sun systems. The chipwill go into production in the fourth quarterof 1999. The architecture enables multi-processor systems with more than 1,000processors to be easily built and still achieveexcellent performance. Starting at 600-MHztarget clock rates, we plan to extend the Ultra-SPARC-III architecture to achieve clock ratesin excess of 1 GHz with UltraSPARC-IV.

References1. D. Weaver and T. Germond, The SPARC

Architecture Manual, Version 9, PrenticeHall, Englewood Cliffs, N.J., 1994.

2. N. Wilhelm, “Why Wire Delays Will NoLonger Scale for VLSI Chips,” Tech. ReportTR-95-44, Sun Laboratories, Mountain View,Calif., 1995.

3. K.J. Nowka and M.J. Flynn, Wave Pipeliningof High-Performance CMOS Static RAM,Tech. Report CSL-TR-94-615, StanfordUniversity, Stanford, Calif., 1994.

4. S.-T. Pan, K. So, and J. Rameh, “ImprovingBranch Prediction Accuracy using BranchCorrelation,” Proc. Fifth Conf. ArchitecturalSupport for Programming Languages andOperating Systems, ACM Press, New York,1992, pp. 76–84.

5. J.E. Smith and J.E. Pleszkun, “Implementa-tion of Precise Interrupts in Pipelined Proces-sors,” Proc. 12th Ann. Int’l Symp. ComputerArchitecture, ACM Press, p. 36–44.

6. R. Heald et al., “64-KByte Sum-Addressed-Memory Cache with 1.6-ns Cycle and 2.6nsLatency,” IEEE J. Solid-State Circuits, Nov.1998, pp. 1,682–1,689.

7. N. Jouppi, “Improving Direct-Mapped CachePerformance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,”Proc. 17th Ann. Int’l Symp. ComputerArchitecture, IEEE Computer Soc. Press,Los Alamitos, Calif., 1990, pp. 364–373.

8. J.-L. Baer and T.-F. Chen, “An Effective On-Chip Preloading Scheme to Reduce DataAccess Penalty,” Proc. Supercomputing 91,IEEE Computer Soc. Press, 1991, pp.176–186.

9. N. Jouppi, “Cache Write Policies andPerformance,” Proc. 20th Ann. Int’l Symp.Computer Architecture, ACM Press, 1992,pp. 191–201.

10. F. Klass, “A Non-Blocking Multiple-PhaseClocking Scheme for Dynamic Logic,” IEEEInt’l Workshop on Clock DistributionNetworks Design, Synthesis, and Analysis,IEEE Press, Piscataway, N.J., 1997.

11. F. Klass, “Semi-dynamic and Dynamic Flip-flops with Embedded Logic,” Digest ofTech. Papers, 1998 Symp. VLSI Circuits,IEEE Press, 1998, pp. 108–109.

Tim Horel is Megacell group manager for theUltraSPARC-III development team at SunMicrosystems Inc. He previously held productdevelopment and engineering positions atAMCC and IBM. Horel has a BSEE from theState University of New York at Buffalo.

Gary Lauterbach is a distinguished engineerat Sun Microsystems Inc. and chief architectof the UltraSPARC-III. In addition to micro-processor design, he has worked on operatingsystem design, CAE tools, process control sys-tems and microwave communication systems.He has a BSEE from the New Jersey Instituteof Technology. Lauterbach is a past memberof the ACM and the IEEE.

Contact the authors about this article atSun Microsystems Inc., {tim.horel, gary.lauterbach}@eng.sun.com.

You may find information concerning Ultra-SPARC-III’s comparison to its major competi-tors in an interview with author GaryLauterbach in the June 1999 issue of Comput-er magazine in the Computing Practices section.

85MAY–JUNE 1999

Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

ULTRASPARC-III - UNLP

Documents