DSP SHARK Processors PART2

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 1

Analog Devices SHARC

CS 433Processor Presentation Series

Prof. Luddy Harrison

A property of MVG_OMALLOOR

PD

F processed w

ith CuteP

DF

evaluation editionw

ww

.CuteP

DF

.com

http://www.cutepdf.com


Note on this presentation series

These slide presentations were prepared by students of CS433 at the University of Illinois at Urbana-ChampaignAll the drawings and figures in these slides were drawn by the students. Some drawings are based on figures in the manufacturer’s documentation for the processor, but none are electronic copies of such drawingsYou are free to use these slides provided that you leave the credits and copyright notices intact



Overview

Processor HistoryPhysical packagingData paths, register files, computational unitsPipelining, timing informationMemoryInstruction Set Architecture (ISA)Applications targetedSystems employing the SHARC



SHARC Features

Super Harvard ARChitectureUnique CISC architecture allows simultaneous fetch of two operands and an instruction in one cycle

Combines high performance DSP core with integrated, on-chip system features

Dual-ported (processor and I/O) SRAM DMA Controller

Selective Instruction CacheCache only those instructions whose fetches conflict with program memory data accesses



SHARC Processor History

ADSP-2106x (2000)Single computational units based on predecessor ADSP-2100 Family40 MHz core

ADSP-2116x (2001)SIMD (Single-Issue Multiple-Data) dual computational unit architecture added150-200 MHz core, 1-2 MB RAM

ADSP-2126x, ADSP-2136x (2003 – Future)Integrated audio-centric peripherals (128-140db Sample Rate Conversion) added333-400 MHz core, 2-3 MB RAM



ADSP-2106x OverviewCORE PROCESSOR DUAL-PORTED SRAM

TIMER INSTRUCTION CACHE

PROGRAMSEQUENCER

DAG2DAG1

BUSCONNECT

(PX)

DATAREGISTER

FILE

16x40-BITMULTIPLIER BARRELSHIFTER

ALU

PM DATA BUS

DM DATA BUS

PM ADDRESS BUS

DM ADDRESS BUS

TWO INDEPENDENTDUAL-PORTED BLOCKS

PROCESSOR PORT I/O PORT BLCO

K 0

BLCO

K 1

EXTERNALPORT

ADDR BUSMUX

MULTIPROCINTERFACE

DATA BUSMUX

HOST PORT

IOPREGISTERS

CONTROL,STATUS &

DATA BUFFERS

DMACONTROLLER

SERIAL PORTS (2)

LINK PORTS (6)

I/OPROCESSOR



ADSP-2106x Core

Computational UnitsALU, Multiplier, and Shifter can all perform independent operations in a single cycle

Register FileTwo sets (primary and alternate) of 16 registers, each 40-bits wide

Program Sequencer and Data Address Generators

Allows computational units to operate independent of instruction fetch and program counter increment



ADSP-2106x PackagingADSP-2106x

CLKINEBOOTLBOOT

IRQFLAGTIMEXP

LxCLKLxACKLxDAT

TCLK0RCLK0TFS0RFS0DT0DR0

TCLK1RCLK1TFS1RFS1DT1DR1

LINK DEVICES

SERIALDEVICE

SERIALDEVICE

1x CLOCKBMS

ADDR31-0

DATA47-0

RDWRACK

MS3-0PAGESBTS

SWADRCLK

DMAR1-2DMAG1-2

CSHBRHBG

REDY

BR1-6CPA

RPBAID2-0

CON

TRO

L

ADD

RES

S

DAT

A

CSADDR BOOT EPROMDATA

ADDRDATAOE MEMORY &WE PERIPHERALSACKCS

DMA DEVICE

DATA

HOST PROCESSORINTERFACE

ADDR

DATA



ADSP-2106x Key Pins

PIN FUNCTION NOTE

ADDR31-0 External Bus Address

DATA47-0 External Bus Data

Memory Select Lines

PAGE DRAM Page Boundary Asserted if a page boundary is crossed

DMAR(1-2) DMA Request 1 and 2

IRQ2-0 Interrupt Request Lines Edge-triggered or level-sensitive

MS3-0Asserted (low) as chip selects memory bank



ADSP-2106x Registers

Data RegistersR15 – R0 (fixed-point), F15 – F0 (floating-point)

Program SequencerPC (program counter), PCSTKP (PC stack pointer), FADDR (fetch address), etc.

Data Address GeneratorI7 – I0 (DAG1 index), M7 – M0 (DAG1 modify)L7 – L0 (DAG1 length), B7 – B0 (DAG1 base)

Bus Exchange, Timer, and System Registers



ADSP-2106x Buses

AddressProgram Memory Address – 24 bits wideData Memory Address – 32 bits wide

DataProgram Memory Data – 48 bits wide

Stores instructions and data for dual-fetchesData Memory Data – 40 bits wide

Stores data operandsOne PM Data bus and/or one DM Data bus register file access per cycle



ADSP-2106x I/O

Serial PortsOperate at clock rate of processor

DMAPort data can be automatically transferred to and from on-chip memory



ADSP-2106x DMA

I/O port block transfers (link/serial)External memory block transfers DMA Channel setup by writing memory buffer parameters to DMA parameter registers

Starting Address for BufferAddress ModifierWord Count

Interrupt generated when transfer completes (i.e. Word Count = 0)



ADSP-2106x DMA Registers31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

FSEXT. PORT FIFO

FLSHFLUSH EXT. PORT FIFO

EXTERNEXT. DEVICES TO EXT. MEM.

INTIOSINGLE-WORD INTERRUPTS

HSHAKEDMA HANDSHAKE

MASTERDMA MASTER MODE

MSWFMOST SIGNIFICANT WORD FIRST

DENDMA ENABLE

CHENDMA CHAINING ENABLE

TRANDMA CHANNEL DIRECTION

PSPACKING STATUS

DTYPEDATA TYPE

PMODEPACKING MODE



ADSP-2106x Pipelining

Three phasesFetch

Read from cache or program memoryDecode

Generate conditions for instructionExecute

Operations specified by instruction completed



ADSP-2106x Branching and Pipelining

BranchesDelayed

Two instructions after branch are executedNon-delayed

Program sequencer suppresses instruction execution for next two instructions

CLOCK CYCLES

Fetch n + 2 j j + 1 j + 2

Decode n + 1 n + 2 j j + 1

Execute n no-op n + 1 no-op n + 2 j

Non-delayed Delayed



ADSP-2106x Memory

On-Chip SRAM ADSP-21060 ADSP-21062 ADSP-21061

Total Size 500KB 250KB 125KB

On-chip support for:48-bit instructions (and 40-bit extended precision floating-point data)32-bit floating-point data16-bit short word data

Off-chip memory up to 4 GB



ADSP-2106x Memory (2)

IOP REGISTERS

RESERVED ADDRESSSPACE

BLOCK 0

BLOCK 1

These represent the samephysical memory

BLOCK 0

BLOCK 1

0x0004 0000

0x0006 0000

0x0007 FFFF

0x0000 0000

0x0000 0100

0x0001 FFFF

0x0002 0000

0x0003 0000

0x0003 FFFF

NORMALWORD

ADDRESSING128K x 32-bit words80K x 40-bit words



ADSP-2106x Memory (3)

Memory divided into blocksDual-ported (PM and DM bus share one port, I/O bus uses the other)

Allows independent access by processor core and I/O processorEach block can be accessed by both in every cycle

Typical DSP applications (digital filters, FFTs, etc.) access two operands at once, such as a filter coefficient and a data sample, so allowing single-cycle execution is a must



ADSP-2106x Shadow Write

Due to need for high-speed operations, memory writes to a two-deep FIFOOn write, data in FIFO from previous write is loaded to memory and new data enters FIFOReads of last two written locations are intercepted and re-routed to the FIFO



ADSP-2106x Instruction Cache

Sequencer checks instruction cache on every program memory data accessAllows PM bus to be used for data fetches instead of being tied up with an instruction fetchWhen fetch conflict first occurs, instruction is cached to prevent the same delay from happening again



ADSP-2106x Instruction Cache (2)

SET 0 ENTRY 0

ENTRY 1

LRU BIT VALID INSTRUCTIONS ADDRESSES (BITS 23-4) ADDRESSES (BITS 3-0)

0000

0001

1110

1111

SET 1 ENTRY 0

ENTRY 1

SET 14 ENTRY 0

ENTRY 1

SET 15 ENTRY 0

ENTRY 1



ADSP-2106x ISA Overview

24 operations, although some have more than one syntactical formInstruction Types

Compute and MoveCompute operation in parallel with data moves or index register modify

Program Flow ControlBranch, Call, Return, Loop

Immediate Data MoveOperand or addressing immediate fields

MiscellaneousBit Modify and Test, No-op, etc.



ADSP-2106x ISACompute and Move

Instructions follow the formatIF condition op1, op2;

IF and condition are optionalop1 is an optional compute instructionop2 is an optional data move instruction



ADSP-2106x ISACompute Examples

Single functionF6 = (F2 + F3);

Multi-functionDistinct parallel operations supportedParallel computations and data transfersR1 = R2 * R6, M4 = R0;

Simultaneous multiplier and ALU operationsR1 = R2 * R6, F6 = F2 + F3;



ADSP-2106x ISASingle function Compute

22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0 CU OPCODE RN RX RY

CU specifies00 – ALU01 – Multiplier02 – Shifter

OPCODE indicates operation type (add, subtract, etc.)RN specifies result registerRX and RY specify operand registers



ADSP-2106x ISAMulti-function Compute

22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1 OPCODE RM RA RXM RYM RXA RYA

Parallel ALU and Multiplier operations

Registers restricted to particular setsMultiplier X: R3 – R0, Y: R7 – R4ALU X: R11 – R8, Y: R15 – R12

OPCODE specifies ALU op, for example:000100: Rm = R3-0 * R7-4, Ra = R11-8 + R15-12;

011111: Rm = R3-0 * R7-4, Ra = MIN(R11-8, R15-12);



ADSP-2106x ISAProgram Flow Control

Instructions follow the formatIF condition JUMP/CALL, ELSE op2;

IF, condition, ELSE are optionalJUMP/CALL is a JUMP or CALL instruction op2 is an optional compute instruction



ADSP-2106x ISAProgram Flow Control (2)

Instructions follow the formatDO <addr24> UNTIL termination;

No optional fields<addr24> is the loop start addresstermination is the loop ending condition to check after each iteration



ADSP-2106x ISAProgram Flow Examples

Conditional ExecutionIF GT R1 = R2 * R6;IF NE JUMP label2;

Also used for Call/Returnmain: CALL routine;

routine: ...RTS; /*return to main*/



ADSP-2106x ISAImmediate Data Move

Instructions follow the formatureg = <data32>;DM(<data32>, Ia) = ureg;PM(<data24>, Ia) = ureg;

Ia is an optional indirect addressorDM is a 32-bit data memory addressPM is a 24-bit program memory address



ADSP-2106x ISAAddressing Examples

DirectJUMP <data24>;

Relative to Program CounterJUMP (PC, <data24>);

Register Indirect (using DAG registers)Pre-Modify (modification pre-address calculation)JUMP (M0, I0);

Post-Modify (modification post-address calculation)JUMP (I0, M0);



ADSP-2116x OverviewExtension of 2106x, adding 150Mhz core and SIMD (Single-Issue Multiple-Data) support via dual hardware

DIFFERENT DATA GOES TO EACH ELEMENT

PM DATA BUS

DM DATA BUSBUS

CONNECT

MULT

DATAREGISTER

FILE BARRELSHIFTER

ALU

MULT

DATAREGISTER

FILEBARRELSHIFTER

PROGRAMSEQUENCER

SAME INSTRUCTION GOES TO BOTH ELEMENTS



ADSP-2116x SIMD Engine

Dual hardware allows same instruction to be executed across different data

2 ALUs, multipliers, shifters, register filesTwo data values transferred with each memory or register file accessVery effective for stereo channel processing

Can effectively double performance over similar algorithms running on ADSP-2106x processors



ADSP-2116x SIMD Engine (2)

Enabled/disabled via MODE1 bitWhen disabled, processor simply acts in SISD mode

Program sequencer must be aware of status flags set by each set of hardware elementsConditional compute operations can be specified on both, either, or neither hardware setConditional branches and loops executed by AND’ing the condition tests on both hardware sets



ADSP-2116x SIMD Engine (3)Instruction Mode Transfer 1 Transfer 2Rx = Ry; SISD Rx loaded from Ry n/a

SIMD Rx loaded from Ry Sx loaded from Sy

Sx = Sy; SISD Sx loaded from Sy n/a

SIMD Sx loaded from Ry Rx loaded from Sy

SIMD Sx loaded from Sy Rx loaded from Ry

Rx = Sy; SISD Rx loaded from Sy n/a

SIMD Rx loaded from Sy Sx loaded from Ry

Sx = Ry; SISD Sx loaded from Ry n/a



ADSP-2126x Overview

Direct extension of 2116x, instructions are fully backward compatibleCore increased to 150-200 MHz w/ 1MB SRAMData buses increased from 32 to 64 bitsSynchronous, independent serial ports increased from 2 to 6ROM-based security

Prevents piracy of code and algorithmsPrevents peripheral devices from reading on-chip memory



ADSP-2136x OverviewCORE PROCESSOR

DAG1 DAG2 PROGRAMSEQUENCER

TIMER INSTRUCTIONCACHE

PROCESSINGELEMENT

(PEX)

PROCESSINGELEMENT

(PEY)

PX REGISTER

PM ADDRESS BUS

DM ADDRESS BUSPM DATA BUS

DM DATA BUS

4 BLOCKS ON-CHIP MEMORYBLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3

SRAM1M BIT

SRAM1M BIT SRAM

0.5M BITSRAM

0.5M BITROM

2M BITROM

2M BIT

IOP REGISTERS

I/O PROCESSOR AND PERIPHEALS

SIGNALROUTING

UNIT

SPISPORTS

IDPPOG

TIMERSSRC

SPDIF

ADDR DATA ADDR DATA ADDR DATA ADDR DATA



ADSP-2136x Overview (2)

Direct extension of 2126x, instructions are fully backward compatible On-chip memory expanded from 2 to 4 blocksDigital Audio Interface (DAI) set of audio peripherals

Interrupt controller, interface data port, signal routing unit, clock generators, and timersDifferent units contain S/PDIF receiver/transmitter, sample rate converters, or DTCP encrypting engine



SHARC BenchmarksAlgorithm benchmarks supplied by manufacturer:

2106x 2116x 2126x 2136xClock Cycle 66 MHz 100 MHz 200 MHz 333 MHz

Instruction Cycle Time

15 ns 10 ns 6.67 ns 3 ns

MFLOPS Sustained

132 MFLOPS 400 MFLOPS 600 MFLOPS 1332 MFLOPS

MFLOPS Peak 198 MFLOPS 600 MFLOPS 900 MFLOPS 1998 MFLOPS

FIR Filter (per tap) 15 ns 5 ns 2.5 ns 1.5 ns

IIR Filter (per biquad)

61 ns 20 ns 10 ns 6 ns

Divide (y/x) 91 ns 30 ns 20 ns 9 ns



Applications Targeted

SHARC designed toSimplify DevelopmentSpeed time to MarketReduce Product Costs

Product targetedA/V Receivers

7.1 Surround Sound DecodingMixing ConsolesDigital SynthesizersAutomobiles



Systems Employing the SHARC

SRS Circle Surround IIMelody (w/ Auto Room Tuner)Metric Halo's Portable Pro Audio HubAlacron FT-P5



SHARC in SRS Circle Surround II

Full multi-channel surround sound from simple right/left stereo soundEncoding can be transmitted over standard stereo medium (broadcast television, radio, etc.) and maintains full backward compatibility



SHARC in SRS Circle Surround II (2)Output from each source is combined in constant phase filter banks and encoded in quadrature to prevent signal cancellation“Positional bias generator” analyzes ratios between left and right surround signals which multipliers then apply to the opposing left or right outputDecoder uses this level imbalance to direct the surround information to the correct output



SHARC Melody

“Optimized Surround Sound for the Mass Market”Core of high-fidelity audio decoders in Denon, Bose, and Kenwood productsAuto Room Tuner (ART) integrated software simplifies setup of complex audio systems



SHARC Melody ART

Automatically measures and corrects multi-channel sound system for room’s acousticsCorrects system deficienciesFor each speaker, ART calculates:

Sound pressure level (SPL) Distance of each speaker from listenerFrequency response



SHARC in Metric Halo's Portable Pro Audio Hub

Portable FireWire-based recording device, used in live recordings applications by motion pictures and major recording artists like “No Doubt” and “Dave Mathews Band”Serial ports used to interface to digital and mixed-signal peripheral devices Initially implemented on SHARC ADSP-2106x, later upgraded to ADSP-2126xFuture hybrid implementation will use a ADSP-2106x for FireWire processing coupled with a ADSP-2126x for audio processing



SHARC in Alacron FT-P5COTS (Commercial Off-The-Shelf) system for use in “distributed, compute intensive, high data rate applications” in commercial and military industriesSupports 1 to 96 ADSP-2106x processorsMakes extensive use of SHARC’s DMA through external PMC interface, supporting full-duplex communication in excess of 1 GB/sec

In-cabinet SAN clustersCompute nodes in distributed systems



SHARC vs. RISC Processors

RISC is...Less costly to design, test, and manufacture, since processors are less specialized

But...Parallel (stereo) computation requires two or more interconnected processors accessing shared memoryLess performance



Conclusion

SHARC offers great deal of computational power, with on-chip SRAM and SIMD architectureVariety of applications (especially audio processing) exploit it



Citations

Processor details taken from product manuals and descriptions at http://www.analog.com


http://www.analog.com/

December 8, 2003 Other ISA's 1

Other ISAs

• Next, we discuss some alternative instruction set designs.– Different ways of specifying memory addresses– Different numbers and types of operands in ALU instructions– A couple of advanced instruction sets

• VLIW (Very Long Instruction Word)– Texas Instruments C64– Analog Devices TigerSHARC

• ARM and Thumb


Addressing modes

• The first instruction set design issue we’ll see are addressing modes, which let you specify memory addresses in various ways.– Each mode has its own assembly language notation.– Different modes may be useful in different situations.– The location that is actually used is called the effective address.

• The addressing modes that are available will depend on the datapath.– Our simple datapath only supports two forms of addressing.– Older processors like the 8086 have zillions of addressing modes.

• We’ll introduce some of the more common ones.


Immediate addressing

• One of the simplest modes is immediate addressing, where the operand itself is accessed.

LD R1, #1999 R1 ← 1999

• This mode is a good way to specify initial values for registers.• We’ve already used immediate addressing several times.

– It appears in the string conversion program you just saw.


Direct addressing

• Another possible mode is direct addressing, where the operand is a constant that represents a memory address.

LD R1, 500 R1 ← M[500]

• Here the effective address is 500, the same as the operand.• This is useful for working with pointers.

– You can think of the constant as a pointer.– The register gets loaded with the data at that address.


Register indirect addressing

• We already saw register indirect mode, where the operand is a register that contains a memory address.

LD R1, (R0) R1 ← M[R0]

• The effective address would be the value in R0.• This is also useful for working with pointers. In the example above,

– R0 is a pointer, and R1 is loaded with the data at that address.– This is similar to R1 = *R0 in C or C++.

• So what’s the difference between direct mode and this one?– In direct mode, the address is a constant that is hard-coded into

the program and cannot be changed.– Here the contents of R0, and hence the address being accessed,

can easily be changed.


• Register indirect mode makes it easy to access contiguous locations in memory, such as elements of an array.

• If R0 is the address of the first element in an array, we can easily access the second element too:

LD R1, (R0) // R1 contains the first elementADD R0, R0, #1LD R2, (R0) // R2 contains the second element

• This is so common that some instruction sets can automatically increment the register for you:

LD R1, (R0)+ // R1 contains the first elementLD R2, (R0)+ // R2 contains the second element

• Such instructions can be used within loops to access an entire array.

Stepping through arrays



Indexed addressing

• Operands with indexed addressing include a constant and a register.

LD R1, 500(R0) R1 ← M[R0 + 500]

• The effective address is the register data plus the constant. For instance, if R0 = 25, the effective address here would be 525.

• We can use this addressing mode to access arrays also.– The constant is the array address, while the register contains an

index into the array.– The example instruction above might be used to load the 25th

element of an array that starts at memory location 500.• It’s possible to use negative constants too, which would let you index

arrays backwards.


PC-relative addressing

• We’ve seen PC-relative addressing already. The operand is a constant that is added to the program counter to produce the effective memory address.

200: LD R1, $30 R1 ← M[201 + 30]

• The PC usually points to the address of the next instruction, so the effective address here is 231 (assuming the LD instruction itself uses one word of memory).

• This is similar to indexed addressing, except the PC is used instead of a regular register.

• Relative addressing is often used in jump and branch instructions.– For instance, JMP $30 lets you skip the next 30 instructions.– A negative constant lets you jump backwards, which is common in

writing loops.


Indirect addressing

• The most complicated mode that we’ll look at is indirect addressing.

LD R1, [360] R1 ← M[M[360]]

• The operand is a constant that specifies a memory location whichrefers to another location, whose contents are then accessed.

• The effective address here is M[360].• Indirect addressing is useful for working with multi-level pointers, or

“handles.”– The constant represents a pointer to a pointer.– In C, we might write something like R1 = **ptr.


Addressing mode summary

Mode Notation Register transfer equivalentImmediate LD R1, #CONST R1 ← CONST

Direct LD R1, CONST R1 ← M[CONST]Register indirect LD R1, (R0) R1 ← M[R0]

Indexed LD R1, CONST(R0) R1 ← M[R0 + CONST]

Relative LD R1, $CONST R1 ← M[PC + CONST]

Indirect LD R1, [CONST] R1 ← M[M[CONST]]


Number of operands

• Another way to classify instruction sets is according to the number of operands that each data manipulation instruction can have.

• Our example instruction set had three-address instructions, because each one had up to three operands—two sources and one destination.

• This provides the most flexibility, but it’s also possible to have fewer than three operands.

ADD R0, R1, R2

operation

destination sources

operands

R0 ← R1 + R2

Register transfer instruction:


Two-address instructions

• In a two-address instruction, the first operand serves as both the destination and one of the source registers.

• Some other examples and the corresponding C code:

ADD R3, #1 R3 ← R3 + 1 R3++;MUL R1, #5 R1 ← R1 * 5 R1 *= 5;NOT R1 R1 ← R1’ R1 = ~R1;

ADD R0, R1

operation

destinationand source 1

source 2

operands

R0 ← R0 + R1




• Some computers, like this old Apple II, have one-address instructions.• The CPU has a special register called an accumulator, which implicitly

serves as the destination and one of the sources.

• Here is an example sequence which increments M[R0]:

LD (R0) ACC ← M[R0]ADD #1 ACC ← ACC + 1ST (R0) M[R0] ← ACC

One-address instructions

ADD R0

operation source

ACC ← ACC + R0



The ultimate: zero addresses

• If the destination and sources are all implicit, then you don’t have to specify any operands at all!

• This is possible with processors that use a stack architecture. – HP calculators and their “reverse Polish notation” use a stack.– The Java Virtual Machine is also stack-based.

• How can you do calculations with a stack?– Operands are pushed onto a stack. The most recently pushed

element is at the “top” of the stack (TOS).– Operations use the topmost stack elements as their operands.

Those values are then replaced with the operation’s result.


Stack architecture example

• From left to right, here are three stack instructions, and what the stack looks like after each example instruction is executed.

• This sequence of stack operations corresponds to one register transfer instruction:

TOS ← R1 + R2

R1… stuff 1 …… stuff 2 …

R2R1

… stuff 1 …… stuff 2 …

R1 + R2… stuff 1 …… stuff 2 …

(Top)

(Bottom)

PUSH R1 PUSH R2 ADD


Data movement instructions

• Finally, the types of operands allowed in data manipulation instructions is another way of characterizing instruction sets.– So far, we’ve assumed that ALU operations can have only register

and constant operands.– Many real instruction sets allow memory-based operands as well.

• We’ll use the book’s example and illustrate how the following operation can be translated into some different assembly languages.

X = (A + B)(C + D)

• Assume that A, B, C, D and X are really memory addresses.


Register-to-register architectures

• Our programs so far assume a register-to-register, or load/store, architecture, which matches our datapath from last week nicely.– Operands in data manipulation instructions must be registers.– Other instructions are needed to move data between memory and

the register file.• With a register-to-register, three-address instruction set, we might

translate X = (A + B)(C + D) into:

LD R1, A R1 ← M[A] // Use direct addressingLD R2, B R2 ← M[B]ADD R3, R1, R2 R3 ← R1 + R2 // R3 = M[A] + M[B]

LD R1, C R1 ← M[C]LD R2, D R2 ← M[D]ADD R1, R1, R2 R1 ← R1 + R2 // R1 = M[C] + M[D]

MUL R1, R1, R3 R1 ← R1 * R3 // R1 has the resultST X, R1 M[X] ← R1 // Store that into M[X]


Memory-to-memory architectures

• In memory-to-memory architectures, all data manipulation instructions use memory addresses as operands.

• With a memory-to-memory, three-address instruction set, we might translate X = (A + B)(C + D) into simply:

• How about with a two-address instruction set?

ADD X, A, B M[X] ← M[A] + M[B]ADD T, C, D M[T] ← M[C] + M[D] // T is temporary storageMUL X, X, T M[X] ← M[X] * M[T]

MOVE X, A M[X] ← M[A] // Copy M[A] to M[X] firstADD X, B M[X] ← M[X] + M[B] // Add M[B]MOVE T, C M[T] ← M[C] // Copy M[C] to M[T]ADD T, D M[T] ← M[T] + M[D] // Add M[D]MUL X, T M[X] ← M[X] * M[T] // Multiply



Register-to-memory architectures

• Finally, register-to-memory architectures let the data manipulation instructions access both registers and memory.

• With two-address instructions, we might do the following:

LD R1, A R1 ← M[A] // Load M[A] into R1 firstADD R1, B R1 ← R1 + M[B] // Add M[B]LD R2, C R2 ← M[C] // Load M[C] into R2ADD R2, D R2 ← R2 + M[D] // Add M[D]MUL R1, R2 R1 ← R1 * R2 // MultiplyST X, R1 M[X] ← R1 // Store


Size and speed

• There are lots of tradeoffs in deciding how many and what kind of operands and addressing modes to support in a processor.

• These decisions can affect the size of machine language programs.– Memory addresses are long compared to register file addresses, so

instructions with memory-based operands are typically longer than those with register operands.

– Permitting more operands also leads to longer instructions.• There is also an impact on the speed of the program.

– Memory accesses are much slower than register accesses.– Longer programs require more memory accesses, just for loading

the instructions!

• Most newer processors use register-to-register designs.– Reading from registers is faster than reading from RAM.– Using register operands also leads to shorter instructions.

December 8, 2003 21Other ISA's

Texas Instruments C64VLIW signal processor


Program fetchProgram fetch

Instruction dispatchInstruction dispatch

Instruction decodeInstruction decode

TMS320C64x CPU

TI C64: ArchitectureTI C64: Architecture

Program cache/program memoryProgram cache/program memory3232--bit addressesbit addresses

Register file A Register file B

.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2

256-bit data

Data cache/data memoryData cache/data memory3232--bit addressbit address

88--, 16, 16--, 32, 32--, 64, 64-- bit databit data

Functional units:6 ALUs(L1, L2, S1, S2, D1, D2)2 multiplers (M1, M2)


TMS320C64x Data PathsTMS320C64x Data Paths

The data path of C64x has the following components:

Two load-from-memory data paths;

Two store-to-memory data paths;

Two data address paths;Two register file data

cross paths;

Data path A

Data path B

Register file A

(A0-A31)

Register file B

(B0-B31)

.L1

.S1

.M1

.D1

.L2

.S2

.M2

.D2

LD1bLD1a

LD1aLD1b

DA1

DA2

ST1bST1a

ST2aST2b


TI C64: Functional Units (Structure)TI C64: Functional Units (Structure)

.S1

.M1.D1

.L1

src1

src2

dstlong dstlong src

src1src2

dst

long src

long dstdst

src1

src2

long dstdst

src1

src2

Each functional unit has its own 32-bit write port into a GPR. Each functional unit reads directly from its own data path;

All units ending in 1 write to register file A, and all units ending in 2 write to register file B;

Each functional unit has two 32-bit read ports for source operands src1 and src2;

L and S units have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads;

Each C64x multiplier can return up to a 64-bit result;



TI C64: .L (.L1 and .L2) Unit Operations Performed• 32/40-bit arithmetic and compare operations• 32-bit logical operations• Leftmost 1 or 0 counting for 32 bits• Normalization count for 32 and 40 bits• Byte shifts• Data packing/unpacking• 5-bit constant generation• Vector Operations:

– Dual 16-bit arithmetic operations– Quad 8-bit arithmetic operations– Dual 16-bit min/max operations– Quad 8-bit min/max operations


.S (.S1 and .S2) Unit Operations Performed

• 32-bit arithmetic operations• 32/40-bit shifts and 32-bit bit-field operations• 32-bit logical operations• Branches• Constant generation• Register transfers to/from control register file (.S2 only)• Byte shifts• Data packing/unpacking• Vector Operations

– Dual 16-bit compare operations– Quad 8-bit compare operations– Dual 16-bit shift operations– Dual 16-bit saturated arithmetic operations– Quad 8-bit saturated arithmetic operations


.M (.M1 and .M2) Unit Operations Performed

• 16 x 16 multiply operations• 16 x 32 multiply operations• Vector Operations

– Quad 8 x 8 multiply operations– Dual 16 x 16 multiply operations– Dual 16 x 16 multiply with add/subtract operations– Quad 8 x 8 multiply with add operation

• Bit expansion• Bit interleaving/de-interleaving• Variable shift operations• Rotation• Galois Field Multiply


.D (.D1 and .D2) Unit Operations Performed

• 32-bit add, subtract, linear and circular address calculation (for circular arrays)• Loads and stores with 5-bit constant offset• Loads and stores with 15-bit constant offset (.D2 only)• Load and store double words with 5-bit constant• Load and store non-aligned words and double words• 5-bit constant generation• 32-bit logical operations


Instruction to Functional Unit Mapping

.D UnitADD STB (15-bit offset)‡ADDAB STH (15-bit offset)‡ADDAH STW (15-bit offset)‡ADDAW SUBLDB SUBABLDBU SUBAHLDH SUBAWLDHU ZEROLDWLDB (15-bit offset)‡LDBU (15-bit offset)‡LDH (15-bit offset)‡LDHU (15-bit offset)‡LDW (15-bit offset)‡MVSTBSTHSTW

.S UnitADD SETADDK SHLADD2 SHRAND SHRUB disp SSHLB IRP† SUBB NRP† SUBUB reg SUB2CLR XOREXT ZEROEXTUMVMVC†MVKMVKHMVKLHNEGNOTOR

.M UnitMPYMPYUMPYUSMPYSUMPYHMPYHUMPYHUSMPYHSUMPYHLMPYHLUMPYHULSMPYHSLUMPYLHMPYLHUMPYLUHSMPYLSHUSMPYSMPYHLSMPYLHSMPYH

.L UnitABSADDADDUANDCMPEQCMPGTCMPGTUCMPLTCMPLTULMBDMVNEGNORMNOTORSADDSATSSUBSUBSUBUSUBCXORZERO


Instruction Packets• Instructions are always fetched 8 (256-bits) at a time. This is called a

fetch packet• If the p-bit of instruction i is set, then instruction i and i+1 are

executed in the same cycle in parallel. • 1 to 8 instructions can be executed in parallel. This is called an execute

packet• In the C62x, packets could not cross the 8-word boundary, and thus

the 8th p-bit was always 0 and padding with NOPs was needed. The C64x did away with that restriction, and execute packets may now span multiple fetch packets.



Fetch Packet Example

G H4

E F3

D2

A B C1

InstructionsCycle/Execute Packet


C64x Opcode MapC64x Opcode Map

Operations on the .L unit:

1 1 0 s p

1 02345

opxsrc1/cstsrc2dstcreg z

11121318 1723 2231 29 28 27

Operations on the .M unit:

0 0 0 s p

1 02345


11121318 1723 2231 29 28 27 6

00

7

Operations on the .M unit:

0 0 0 s p

1 02345

opsrc1/cstsrc2dstcreg z

121318 1723 2231 29 28 27 6

01

7


C64x Opcode MapC64x Opcode Map

Load/store with 15-bit offset on the .D unit :

1 1 s p

1 02346

ucst15dst/srccreg z

23 2231 29 28 27 7

ld/st

8

y

Load/store with baseR + offset/cst on the .D unit :

1 1 s p

1 02346

modeoffset/usct5baseRdst/srccreg z

121318 1723 2231 29 28 27 7

ld/st

8

y

9

r

Operations on the .S unit:

0 0 0 s p

1 02345


11121318 1723 2231 29 28 27 6

1

ADDK on the .S unit:

1 0 0 s p

1 02345

cstdstcreg z

23 2231 29 28 27 6

01

7


Analog Device TigerSHARCVLIW Vector Signal Processor


ADI TigerSHARC: Core Block Diagram


ADI TigerSHARC: Computation Block Block Diagram



Register Data Formats


Instruction Line Organization


Instruction Encoding


Compute Block


IALU


Load and Store



Sequencer


ARM and ThumbLow Power General Purpose Microprocssors


ARM Family Overview• Architecture Versions

– ARM V3, V4, V5, V6– Called “architecture” in their literature, this is the programmer’s

view of the machine• The externally visible architecture• It is primarily a matter of Instruction Set Architecture

• Implementations– ARM7, ARM9, ARM10, ARM11

• With letter extensions – to be explained shortly– Called “cores” in their literature


ARM Evolution

28 Jan 2005 Copyright ARM Ltd. 2002

ARM11 MicroArchitecture



December 8, 2003 Other ISA's 49 December 8, 2003 Other ISA's 50

ARMv5T

(ARM)


ARMv5T

(Thumb)


Summary

• Instruction sets can be classified along several lines.– Addressing modes let instructions access memory in various ways.– Data manipulation instructions can have from 0 to 3 operands.– Those operands may be registers, memory addresses, or both.

• Instruction set design is intimately tied to processor datapath design.

• VLIW and compact, low-power instruction sets represents endpoints on a continuum– The VLIW uses enormous instruction fetch bandwidth to keep lots

of functional units busy– Thumb mode attempts to pack irregular control code into as few

bits as possible to save instruction fetch bandwidth (power)


Date post:	29-Nov-2014
Category:	Documents
Upload:	aravind
View:	142 times
Download:	2 times

DSP SHARK Processors PART2

Documents