Introduction to DSP - Unict to DSP Maurizio Palesi. Maurizio Palesi 2 What is a DSP? Digital ......

Post on 24-Mar-2018

216 views 1 download

transcript

Maurizio Palesi 1

Introduction to DSPIntroduction to DSP

Maurizio Palesi

Maurizio Palesi 2

What is a DSP?What is a DSP? Digital

Operating by the use of discrete signals to represent data in the form of numbers

Signal A variable parameter by which information is conveyed through an

electronic circuit

Processing To perform operations on data according to programmed

instructions

Digital Signal Processing Changing or analysing information which is measured as discrete

sequences of numbers

Maurizio Palesi 3

Main CharacteristicsMain Characteristics Compared to other embedded computing applications,

DSP applications are differentiated by the followingComputationally demanding

Iterative numeric algorithms

Sensitivity to small numeric errors (audible noise)Stringent real-time requirementsStreaming dataHigh data bandwidthPredictable (though often eccentric) memory access patternPredictable program flow (nested loops)

Maurizio Palesi 4

DSP ProcessorsDSP Processors

1970DSP techniques in telecommunication

equipment

MicroprocessorCustom

fixed function hardware

DSP processors

Not adequate performance

Not adequate flexibility and reusability

Maurizio Palesi 5

DSP vs. General PurposeDSP vs. General Purpose DSPs adpot a range of specialized features

Single-cycle multiplier Multiply-accumulate operations Saturation arithmetic Separate program and data memories Dedicated, specilized addressing hw Complex, specialized instruction sets

Today, virtually very commercial 32-bit microprocessor architecture (from ARM to 80x86) has been subject to some kind of DSP-oriented enhancement

GP DSP

VLIW, Superscalar, SIMD,multiprocessing, ...

Maurizio Palesi 6

GP Microprocessor GP Microprocessor (circa 1984: Intel (circa 1984: Intel 8088)8088) ~100,000 transistors Clock speed: ~ 5 MHz Address space: 20 bits Bus width: 8 bits 100+ instructions 2-35 cycles per instruction Microcoded architecture Many addressing modes Relatively inexpensive

Apparent trendsLarger address spaceHigher clock speedMore transistorsMore instructionsMore arithmetic

capabilityMore memory

management support

Maurizio Palesi 7

DSP Microprocessors DSP Microprocessors (circa 1984: (circa 1984: TMS32010)TMS32010)

Clock speed: 20 MHz Word/bus width: 16 bits Address space: 8, 12 bits ~50,000 transistors ~ 35 instructions 4-cycle execute of most

instructions Harvard architecture: separate

program and data memory, buses

16x16 hardware multiplier Double-length accumulator with

saturation A few special DSP instructions Relatively expensive

Apparent trendsHigher clock ratesFewer

cycles/instructionSomewhat expanded

address spacesMore specialized DSP

instructionsLower cost

Maurizio Palesi 8

RISC Processors circa 1984RISC Processors circa 1984 Academic research topic 12-16 instructions Single-cycle execute No microcode! Small, heavily optimized instruction set executable in single

short cycle All instructions same size No microcode = faster execution Extra speed more than offsets increased code size,

reduced functionality Better compiler target

Maurizio Palesi 9

Arguments Advanced for CISCArguments Advanced for CISC

Fewer instructions per taskShorter programsHardware implementation of complex

instructions faster than softwareExtra addressing modes help compiler

Maurizio Palesi 10

The RISC vs CISC ControversyThe RISC vs CISC Controversy

Lots of argumentHundreds of papersHottest topic in computer architectureIn mid to late ‘80s, many RISC uPs

introduced: MIPS, SPARC (Sun), MC88000, PowerPC, I960 (Intel), PA-RISC

For a time, RISC looked tough to beat ...

Maurizio Palesi 11

CISC Processors circa 1999CISC Processors circa 1999 Clock Speed: ~400 MHz Several million transistors 32-bit address space or more 32-bit external buses, 128-bit internally ~ 100 instructions Superscalar CPU Judiciously microcoded On-chip cache Very complex memory hierarchy Single-cycle execute of most instructions! 32-bit floating-point ALU on board! Multimedia extensions Harvard architecture (internally)! Very expensive (100s of dollars) 10s of Watts power consumption

Maurizio Palesi 12

RISC Architectures circa 1999RISC Architectures circa 1999

The same!

Maurizio Palesi 13

DSP Microprocessors circa 1999DSP Microprocessors circa 1999

Clock speed: 100-200 MHz 16-bit (fixed point) or 32-bit (floating point) buses

and word sizes 16-24 bit address space Some on-chip memory Single-cycle execution of most instructions Harvard architecture Lots of special DSP instructions 50mW to 2 W power consumption Cheap!

Maurizio Palesi 14

QuestionQuestion If CISC and RISC have adopted all the distinguishing

features of early DSP microprocessors and more, why didn’t they take over the DSP embedded market, too?Current high-volume DSP applications (e.g., hard disk drive

controllers, cell phones) require low cost, low powerDSP uPs stripped of all but the most essential features for DSP

applicationsMost quoted numbers for DSP uPs not MIPS, but MIPS/$$, MIPS/

mWMarket needs in DSP embedded systems are sufficiently different

that no single architectural family can compete in both DSP and general-purpose uP market

Maurizio Palesi 15

Time Domain ProcessingTime Domain ProcessingCorrelation Autocorrelation to extract a signal from

noise Cross correlation to locate a know signal Cross correlation to identify a signal Convolution

Maurizio Palesi 16

CorrelationCorrelation Correlation is a weighted moving average

Requires a lot of calculation If one signal is of length M and the other is of length N, then we

need (N * M) multiplications, to calculate the whole correlation function

Note that really, we want to multiply and then accumulate the result - this is typical of DSP operations and is called a multiply & accumulate operation

r n =∑k

x k × y kn

x

y

Shift y by n

Multiply the two together

Integrate

Maurizio Palesi 17

CorrelationCorrelation Correlation is a maximum when two signals are similar in shape

Correlation is a measure of the similarity between two signals as a function of time shift between them

If two signals are similar and unshifted...

their product is all positive

But as the shift increase...

parts of it become negative...

and the correlation function shows where the signals are similar and unshifted

Maurizio Palesi 18

EEG signal

EEG autocorrelation

Detecting PeriodicityDetecting Periodicity

Autocorrelation as a way to detect periodicity in signals

Maurizio Palesi 19

EEG signalwith noise

EEG with noise autocorrelation

Detecting PeriodicityDetecting Periodicity

Although a rhythm is not even visible (upper trace) it is detected by autocorrelation (lower trace)

Maurizio Palesi 20

Align SignalsAlign Signals

6

6

Signal x

Signal y

corr(x,y)

Maurizio Palesi 21

Align SignalsAlign Signals

x

y

Maurizio Palesi 22

Cross correlationCross correlation Cross correlation (correlating a signal with another) can be

used to detect and locate known reference signal in noise

A radar or sonar ‘chirp’ signal

bounced off a target may be buried in noise...

bounced but correlating with the ‘chirp’ reference

crearly reveals when the echo comes

Maurizio Palesi 23

Cross Corelation to Identify a SignalCross Corelation to Identify a Signal

Cross correlation (correlating a signal with another) can be used to identify a signal by comparison with a library of known reference signals

The chirp of a nightingale...

correlates strongly with another nightgale...

but weakly with a dove...

or a heron...

Maurizio Palesi 24

Cross Corelation to Identify a SignalCross Corelation to Identify a Signal

Cross correlation is one way in which sonar can identify different types of vesselEach vessel has a unique sonar signature

The sonar system has a library of pre-recorded echoes from different vessels

An unknown sonar echo is correlated with a library of reference echoes

The largest correlation is the most likely match

Maurizio Palesi 25

ConvolutionConvolution Correlation is a weighted moving average with one signal

flipped back to front

Requires a lot of calculation If one signal is of length M and the other is of length N, then we

need (N * M) multiplications, to calculate the whole convolution function

We need to multiply and then accumulate the result - this is typical of DSP operations and is called a multiply & accumulate operation

r n =∑k

x k × y k−n

To convolve one signal

with another signal

first flip the second signal

Then shift it

Then multiply the two together

And integrate under the curve

Maurizio Palesi 26

Convolution vs. CorrelationConvolution vs. Correlation Convolution is used for digital filtering

Convolving two signals is equivalent to multiplying the frequency spectra of the two signals togetherIt is easily understood, and is what we mean by filtering

Correlation is equivalent to multiplying the complex conjugate of the frequency spectrum of one signal by the frequency spectrum of the otherIt is not so easily understood and so convolution is used for

digital filtering

Convolving by multiplying frequency spectra is called fast convolution

Maurizio Palesi 27

Fourier TransformFourier Transform The Fourier transform is an equation to calculate the frequency,

amplitude and phase of each sine needed to make up any given signal

The Fourier Transform (FT) is a mathematical formula using integrals

The Discrete Fourier Transform (DFT) is a discrete numerical equivalent using sums instead of integrals

The Fast Fourier Transform (FFT) is just a computationally fast way to calculate the DFT

The Discrete Fourier Transform involves a summation

DFT and the FFT involve a lot of multiply and accumulate the result

This is typical of DSP operations and is called a multiply & accumulate operation

H f =∑k

c [k ]×e−2π jk fΔ

Maurizio Palesi 28

FilteringFiltering

The function of a filter is to remove unwanted parts of the signalRandom noise

Extract useful parts of the signal

Components lying within a certain frequency range

FiltersAnalog

Digital

FilterRaw

signalFilteredsignal

Maurizio Palesi 29

Analog FiltersAnalog FiltersAn analog filter uses analog electronic

circuits Use components such as resistors, capacitors

and op amps

Widely used in such applicationsNoise reduction

Video signal enhancement

Graphic equalisers in hi-fi systems

..., and many other areas

Maurizio Palesi 30

Digital FiltersDigital FiltersA digital filter uses a digital processor to

perform numerical calculations on sampled values of the signalSpecialised DSP chip

DSP

Unfilteredanalogsignal

Filteredanalogsignal

A/D D/A

Sampleddigitisedsignal

Digitallyfilteredsignal

Maurizio Palesi 31

Advantage of Digital FiltersAdvantage of Digital Filters Programmability

The digital filter can easily be changed without affecting the circuitry

Analog filter circuits are subject to drift and are dependent on temperature

Digital filters can handle low frequency signals accurately As the speed of DSP technology continues to increase,

digital filters are being applied to high frequency signals in the RF domain

Versatility Adapt to changes in the characteristics of the signal

Maurizio Palesi 32

DSP ProcessorsDSP ProcessorsCharacteristic features of DSP processors Addressing modesMemory architecturesBrief overview of the TMS320C3x family

General featuresAddressing modesISA overviewAssembly programming

FIR filtesMatrix-Vector multiplication

Maurizio Palesi 33

Characteristics of DSP ProcessorsCharacteristics of DSP Processors

DSP processors are mostly designed with the same few basic operations in mind

They share the same set of basic characteristicsSpecialised high speed arithmetic Data transfer to and from the real world Multiple access memory architectures

Maurizio Palesi 34

Characteristics of DSP ProcessorsCharacteristics of DSP Processors

The basic DSP operationsAdditions and multiplications

Fetch two operands Perform the addition or

multiplication (usually both) Store the result or hold it for a

repetition

Delays Hold a value for later use

Array handling Fetch values from consecutive

memory locationsCopy data from memory to

memory

Z-1

Z-2

c[0]

c[1]

c[2]

x y

Maurizio Palesi 35

Characteristics of DSP ProcessorsCharacteristics of DSP Processors

To suit these fundamental operations DSP processors often haveParallel multiply and add Multiple memory accesses (to fetch two operands and store the

result) Lots of registers to hold data temporarily Efficient address generation for array handling

Special features such as delays or circular addressing

Maurizio Palesi 36

Address GenerationAddress Generation The ability to generate new addresses efficiently is a characteristic

feature of DSP processors

Usually, the next needed address can be generated during the data fetch or store operation, and with no overhead

DSP processors have rich sets of address generation operations

*rP register indirect read the data pointed to by the address in register rP

*rP++ postincrement

*rP-- postdecrement

*rP++rI

*rP++rIr bit reversed

having read the data, postincrement the address pointer to point to the next value in the arrayhaving read the data, postdecrement the address pointer to point to the previous value in the array

register postincrement

having read the data, postincrement the address pointer by the amount held in register rI to point to rI values further down the arrayhaving read the data, postincrement the address pointer to point to the next value in the array, as if the address bits were in bit reversed order

Maurizio Palesi 37

Bit Reversed AddressingBit Reversed Addressing DSPs are tightly targeted to a small number of algorithms

It is surprising that an addressing mode hase been specifically defined for just one application (the FFT)

0 (0002) 0 (0002)1 (0012) 4 (1002)2 (0102) 2 (0102)3 (0112) 6 (1102)4 (1002) 1 (0012)5 (1012) 5 (1012)6 (1102) 3 (0112)7 (1112) 7 (1112)

Addresses generated by a radix-2 FFT Whithout special support such address transformations would Take an extra memory access to

get the new address Involve a fair amount of logical

instructions

Maurizio Palesi 38

Memory AddressingMemory Addressing As DSP programmers migrate toward larger programs, they are more

attracted to compilers Such compilers are not able to fully exploit such specific addressing modes DSP community routinely uses library routines

Programmers may benefit even if they write at a high level

Addressing mode Percent

Immediate 30.02%

Displacement 10.82%

Register indirect 17.42%

Direct 11.99%

Autoincrement, postincrement 18.84%

Autoincrement, preincrement with 16 bit immediate 0.77%

Autoincrement, preincrement with circular addresing 0.08%

Autoincrement, postincrement by contents of AR0 1.54%

Autoincrement, postincrement by contents of AR0, with circular addressing 2.15%

Autodecrement, postdecrement 6.08%

~90%

Maurizio Palesi 39

DSP Processors: Input/OutputDSP Processors: Input/Output

DSP

DSP DSP

System controller

Other DSP

Signal In Signal Out

DSP is mostly dealing with the real world• Communication with an overall system controller• Signals coming in and going out• Communication with other DSP processors

Maurizio Palesi 40

DSP EvolutionDSP Evolution When DSP processors first came out, they were rather fast processors

The first floating point DSP, the AT&T DSP32, ran at 16 MHz at a time when PC computer clocks were 5 MHz

A fashionable demonstration at the time was to plug a DSP board into a PC and run a fractal (Mandelbrot) calculation on the DSP and on a PC side by side

The DSP fractal was of course faster

Today…

The fastest DSP processor is the Texas TMS320C6201 which runs at 200 MHz

This is no longer very fast compared with an entry level PC

– ...And the same fractal today will actually run faster on the PC than on the DSP!

But…

Try feeding eight channels of high quality audio data in and out of a Pentium simultaneously in real time, without impacting on the processor performance

Maurizio Palesi 41

SignalsSignalsThey are usually handled by high speed

synchronous serial portsSerial ports are inexpensive

Having only two or three wires

Well suited to audio or telecommunications data rates up to 10 Mbit/s

Usually operate under DMAData presented at the port is automatically written

into DSP memory without stopping the DSP

Maurizio Palesi 42

Host CommunicationsHost Communications Many systems will have another, general purpose,

processor to supervise the DSPFor example, the DSP might be on a PC plug-in card

Whereas signals tend to be continuous, host communication tends to require data transfer in batches for instance to download a new program or to update filter

coefficients

Some DSP processors have dedicated host portsLucent DSP32C has a host port which is effectively an 8 bit or 16

bit ISA bus

the Motorola DSP56301 and the Analog Devices ADSP21060 have host ports which implement the PCI bus

Maurizio Palesi 43

Interprocessor CommunicationsInterprocessor Communications

Interprocessor communications is needed when a DSP application is too much for a single processor

The Texas TMS320C40 and the Analog Devices ADSP21060 both have six link ports Would ideally be parallel ports at the word length of the

processor, but this would use up too many pins

A hybrid called serial/parallel is used 'C40, comm ports are 8 bits wide and it takes four transfers to

move one 32 bit word

21060, link ports are 4 bits wide and it takes 8 transfers to move one 32 bit word

Maurizio Palesi 44

Memory ArchitecturesMemory Architectures Additions and multiplications require us to

Fetch two operands

Perform the addition or multiplication (usually both)

Store the result or hold it for a repetition

To fetch the two operands in a single instruction cycle, we need to be able to make two memory accesses simultaneouslyPlus one access to write back the result

Plus one access to fetch the instruction itself

Maurizio Palesi 45

Memory ArchitecturesMemory Architectures There are two common methods to achieve

multiple memory accesses per instruction cycleHarvard architecture

Modified von Neuman architecture

Maurizio Palesi 46

Harvard ArchitectureHarvard Architecture

DSP operations usually involve at least two operandsDSP Harvard architectures usually permit the program bus to be

used also for access of operands

It is often necessary to fetch the instruction too

The Harvard architecture is inadequate to support this

Super Harvard architecture (SHARC)

– DSP Harvard architectures often also include a cache memory, leaving both Harvard buses free for fetching operands

DSPProgram Data

Maurizio Palesi 47

Modified von Neuman ArchitecturesModified von Neuman Architectures

The Harvard architecture requires two memory busesThis makes it expensive to bring off the chip

Even the simplest DSP operation requires four memory accesses (three to fetch the two operands and the instruction, plus a fourth to write the result)This exceeds the capabilities of a Harvard architecture

Some processors get around this by using a modified von Neuman architecture

Maurizio Palesi 48

Modified von Neuman ArchitecturesModified von Neuman Architectures

The modified von Neuman architecture allows multiple memory accesses per instructionRun the memory clock faster than the instruction cycle

Lucent DSP32C runs with an 80 MHz clockThis is divided by four to give 20 MIPS

The memory clock runs at the full 80 MHz

Each instruction cycle is divided into four 'machine states' and a memory access can be made in each machine state

DSPProgram

&Data

Maurizio Palesi 49

Example ProcessorExample ProcessorAddress generation Lots of registers

Efficient I/OParallel multiply/add

Multiple memories

Other DSP

Systemcontroller

Signal out

Signal in

Maurizio Palesi 50

Example Processor: Lucent DSP32CExample Processor: Lucent DSP32C

Address 24

Data 32

40

22x24 bit registersAlso serve for integer arithmetic

Modified von Neuman architecture

Maurizio Palesi 51

Example Processor: ASP21060Example Processor: ASP21060

Data address 32Data data 40

Prog. address 24

Prog data 48

Two serial ports

Super Harvard architecture

Six link ports

Maurizio Palesi 52

Data FormatsData Formats DSP processors store data in fixed or floating point formats

The programmer has to make some decisions If a fixed point number becomes too large for the available word

length, he has to scale the number down, by shifting it to the right

If a fixed point number is small, he has to scale the number up, in order to use more of the available word length

0 1 0 1 0 0 1 1

-27 26 25 24 23 22 21 20 = 26 + 24 + 21 + 20= 83

Integer

0 1 0 1 0 0 0 0

-20 2-1 2-2 2-3 2-4 2-5 2-6 2-7 = 2-1 + 2-3 = 0.5 + 0.125 = 0.625

Fixed point

Maurizio Palesi 53

Fixed PointFixed PointFixed point can be thought of as just low-

cost floating pointIt does not include an exponent in every wordNo hw that automatically aligns and normalizes

operandsDSP programmer take cares to keep the exponent in

a separate variableOften this variable is shared by a set of fixed-point

variables– Blocked floating point

Maurizio Palesi 54

Floating PointFloating Point Floating point format has the remarkable property of

automatically scaling all numbers by moving, and keeping track of, the binary point so that all numbers use the full word length available but never overflow

-2-1 20 2-1 2-2 2-3 2-4 2-5 2-6 2-7

Mantissa = 20 + 2-1 + 2-3= 1 + 0.5 + 0.125 = 1.625

Exponent = 22 + 21 = 6

Decimal value = 1.625 × 26

0 1 1 0

Mantissa Exponent0 1 1 0 1 0 0 0 0

-23 22 21 20

Maurizio Palesi 55

Data FormatsData Formats In Floating Point the HW automatically scales and

normalises every number Errors due to truncation and rounding depend on the size

of the number These errors can be seen as a source of quantisation

noiseThen the noise is modulated by the size of the signalThe signal dependend modulation of the noise is undesiderable

because is audibleThe audio industry prefers to use fixed point DSP processors over

floating point

Maurizio Palesi 56

Saturating ArithmeticsSaturating Arithmetics DSPs are often used in real-time applications

No exception on arithmentic overflow It could miss an event

To support such an environment, DSP architectures use saturating arithmetic If the result is too large to be represented, it is set to the largest representable

number

Saturating arithmeticNormal two’s complement arithmetic

Maurizio Palesi 57

Programming a DSP ProcessorProgramming a DSP Processor

A simple FIR filter programUsing pointersAvoiding memory bottlenecksAssembler programming

Maurizio Palesi 58

A Simple FIR FilterA Simple FIR Filter The simple FIR filter equation is

Which can be implemented quite directly in C language

y [n ]=∑k

c [k ]×x [n−k ]

y[n] = 0.0;

for (k=0; k<N; k++)

y[n] = y[n] + c[k] * x[n-k];

Accessedrepeatedly

Accessing by array index is

inefficient

Arithmetic is needed to

calculate this array index

Maurizio Palesi 59

Problem in AddressingProblem in AddressingFive operation to calculate the address of

the element x[n-k]Load the start address of the table in memory Load the value of the index n Load the value of the index k Calculate the offset [n - k] Add the offset to the start address of the array

Only after all five operations can the compiler actually read the array element

Maurizio Palesi 60

Using PointersUsing Pointersy[n] = 0.0;

for (k=0; k<N; k++)

y[n] = y[n] + c[k] * x[n-k];

float *y_ptr, *c_ptr, *x_ptr;

y_ptr = &y[n];

for (k=0; k<N; k++)

*y_ptr = *y_ptr + *c_ptr++ * *x_ptr--;

c x y

c_ptr x_ptr y_ptr

Maurizio Palesi 61

float *y_ptr, *c_ptr, *x_ptr;

y_ptr = &y[n];

for (k=0; k<N; k++)

*y_ptr = *y_ptr + *c_ptr++ * *x_ptr--;

Using PointersUsing Pointers

Each pointer still has to be initialisedBut only once, before the loop

Not requiring any arithmetic to calculate offsets

Using pointers is more efficient than array indices on any processor It is especially efficient for DSP processors

Address increments often come for free

Maurizio Palesi 62

*rP register indirect read the data pointed to by the address in register rP

*rP++ postincrement

*rP-- postdecrement

*rP++rI

having read the data, postincrement the address pointer to point to the next value in the arrayhaving read the data, postdecrement the address pointer to point to the previous value in the array

register postincrement

having read the data, postincrement the address pointer by the amount held in register rI to point to rI values further down the array

Using PointersUsing Pointers

The address increments are performed in the same instruction as the data access to which they referThey incur no overhead at all

Most DSP processors can perform two or three address increments for free in each instruction

So the use of pointers is crucially important for DSP processors

Maurizio Palesi 63

Limiting Memory AccessesLimiting Memory Accesses

Four memory accessesEven without counting the need to load the instruction,

this exceeds the capacity of a DSP processor

Fortunately, DSP processors have lots of registers

float *y_ptr, *c_ptr, *x_ptr;

y_ptr = &y[n];

for (k=0; k<N; k++)

*y_ptr = *y_ptr + *c_ptr++ * *x_ptr--;

Store Load Load Load

Maurizio Palesi 64

Limiting Memory AccessesLimiting Memory Accesses

register float temp;

temp = 0.0;

for (k=0; k<N; k++)

temp = temp + *c_ptr++ * *x_ptr--;

This initialization is wasted!

register float temp;

temp = *c_ptr++ * *x_ptr--;

for (k=1; k<N; k++)

temp = temp + *c_ptr++ * *x_ptr--;

Maurizio Palesi 65

Compiler for DSPsCompiler for DSPs Despite the well documented advantages in programmer productivity

and software maintenance...

Convolution 11.8 16.5 Convolution encoder 44.0 0.5FIR 11.5 8.7 Fixed-point complex FFT 13.5 1.0Matrix 1x3 7.7 8.1 Viterbi GSM decoder 13.0 0.7FIR2dim 5.3 6.5 Fixed-point bit allocation 7.0 1.4Dot product 5.2 14.1 Autocorrelation 1.8 0.7LMS 5.1 0.7N real update 4.7 14.1IIR n biquad 2.4 8.6N complex update 2.4 9.8Matrix 1.2 5.1Complex update 1.2 8.7IIR one biquad 1.0 6.4Real update 0.8 15.6C54 geometric mean 3.2 7.8 C62 geometric mean 10.0 0.8

TMS320C54 D (C54) for DSPstone kernels

Ratio to assembly in execution time (>1

means slower)

Ratio to assembly in code space (>1 means bigger)

TMS320C6203 (C62) for EEMBC Telecom kernels

Ratio to assembly in execution time (>1

means slower)

Ratio to assembly in code space (>1 means bigger)

Maurizio Palesi 66

Introduction Introduction TMS320C3xTMS320C3xThe TMS320C3x generation of DSPs are

high performance 32-bit floating-point devices in the TMS320 family

Extensive internal busingPowerful DSP instruction set 60 MFLOPSHigh degree of on-chip parallelism

Up to 11 operations in a single instruction

Maurizio Palesi 67

General FeaturesGeneral FeaturesGeneral-purpose register fileProgram cacheDedicated auxiliary register arithmetic units

(ARAU)Internal dual-access memoriesDirect memory access (DMA) Short machine-cycle time

Maurizio Palesi 68

Block DiagramBlock Diagram

Maurizio Palesi 69

C3x FamilyC3x Family

TMS320C30• 4K ROM• 2K RAM• Second serial port• Second External bus

TMS320C31• Low cost version• Boot loader program• 2K RAM• Single serial port• Single External bus

TMS320C32• Enhanced version of 'C3x family• Variable-width memory interface• Two channel DMA coprocessor with configurable priorities• Relocatable interrupt vector table

TMS320C33• Low power• Boot loader program• 34K RAM• Single serial port• Single External bus

Maurizio Palesi 70

Typical Applications (1/2)Typical Applications (1/2)

Maurizio Palesi 71

Typical Applications (2/2)Typical Applications (2/2)

Maurizio Palesi 72

Central Processing Unit (CPU)Central Processing Unit (CPU)

The ’C3x devices have a register-based CPU architecture

The CPU consists of the following componentsFloating-point/integer multiplierArithmetic logic unit (ALU)32-bit barrel shifterInternal buses (CPU1/CPU2 and REG1/REG2)Auxiliary register arithmetic units (ARAUs)CPU register file

Maurizio Palesi 73

Block diagram of the CPU

Maurizio Palesi 74

Single-cycle multiplications (33 ns, 30 MHz) 24-bit integer

Result 32-bit

32-bit floating-point Result 40-bit

Maurizio Palesi 75

The ALU performs single-cycle operations on 32-bit integer 32-bit logical 40-bit floating-point data

Single-cycle integer and floating point conversions

The barrel shifter is used to shift up to 32 bits left or right in a single cycle

Maurizio Palesi 76

Four internal busesCPU1, CPU2, REG1, and

REG2 carryTwo operands from

memory Two operands from the

register file

Allowing parallel multiplies and adds/subtracts on four integer or floating-point operands in a single cycle

Maurizio Palesi 77

Two auxiliary register arithmetic units (ARAU0 and ARAU1) can generate two addresses in a single cycle

The ARAUs operate in parallel with the multiplier and ALU

They support addressing with Displacements Index registers (IR0 and IR1) Circular Bit-reversed addressing

Maurizio Palesi 78

28 registers in a multiport register file

All of the primary registers can be Operated upon by the multiplier and

ALU Used as general-purpose registers

The registers also have some special functions The eight extended-precision

registers are especially suited for maintaining extended-precision floating-point results

The eight auxiliary registers support a variety of indirect addressing modes and can be used as general-purpose 32-bit integer and logical registers

The remaining registers provide such system functions as addressing, stack management, processor status, interrupts, and block repeat

Maurizio Palesi 79

RAM, ROM, and CacheRAM, ROM, and Cache Each RAM and ROM block

is capable of supporting two CPU accesses in a single cycle

E.g. In a single cycle the CPU can Access two data values in

one RAM block

Perform an external program fetch

Perform a DMA transfer loading the other RAM block

Maurizio Palesi 80

PeripheralsPeripherals Timers

The two timer modules are general-purpose 32-bit timer/event counters

Serial portsThe bidirectional serial

ports are totally independent

Each serial port can be configured to transfer 8, 16, 24, or 32 bits of data per word

Maurizio Palesi 81

Direct Memory Access (DMA)Direct Memory Access (DMA) The DMA controller can read/write any

location in the memory map without interfering with the CPU operation

Dedicated DMA address and data buses minimize conflicts between the CPU and the DMA controller

When the CPU and DMA access the same resources priorities must be established CPU DMA Rotating

Maurizio Palesi 82

Extended Precision Registers (R7-R0)Extended Precision Registers (R7-R0)

Can store and support operations on 32-bit integer and 40-bit floating-point numbers

Maurizio Palesi 83

Auxiliary Registers (AR7-AR0)Auxiliary Registers (AR7-AR0)

The CPU can access the Eight 32-bit auxiliary registers (AR7−AR0)Two auxiliary register arithmetic units (ARAUs)

The primary function of the auxiliary registers is the generation of 24-bit addressesThey can also operate as loop counters in indirect addressing 32-bit general purpose registers that can be modified by the

multiplier and ALU

Maurizio Palesi 84

Other RegistersOther Registers Data-Page pointer (DP)

The eight LSBs of the data-page pointer are used by the direct addressing mode as a pointer to the page of data being addressed

Data pages are 64K-words long, with a total of 256 pages

Index Registers (IR1, IR0)Used by the ARAU for indexing the address

Block size register (BK)Used by the ARAU in circular addressing to specify the data block

size

System-stack Pointer (SP)Contains the address of the top of the system stackSP always points to the last element pushed onto the stackSP is manipulated by interrupts, traps, calls, returns, and the

PUSH, PUSHF, POP, and POPF instructions

Maurizio Palesi 85

Status Register (ST)Status Register (ST) Contains global information about the state of the CPU

Operations usually set the condition flags of the status register according to whether the result is 0, negative, etc.

Global interrupt enable

Clear cache

Cache enable

Cache freeze

Repeat mode

Overflow mode

Latched floating point

overflow

Latched overflow

floating point

underflow

Negative

Zero

Overflow

Carry

Maurizio Palesi 86

Repeat Counter (RC) and Block Repeat Repeat Counter (RC) and Block Repeat (RS,RE)(RS,RE)

RC is a 32-bit register that specifies the number of times a block of code is to be repeated when a block repeat is performedIf RC=n, the loop is executed n+1 times

RS register contains the starting address of the program-memory block to be repeated when the CPU is operating in the repeat mode

RE register contains the ending address of the program-memory block to be repeated when the CPU is operating in the repeat mode

Maurizio Palesi 87

Instruction Cache (1/2)Instruction Cache (1/2) 64×32-bit instruction cache

2-way set associativeLRU replacement policy

It allows the use of slow, external memories while still achieving single-cycle access performances

The cache also frees external buses from program fetches so that they can be used by the DMA or other system elements

Maurizio Palesi 88

Instruction Cache (2/2)Instruction Cache (2/2)

Maurizio Palesi 89

Addressing ModesAddressing ModesFive types of addressing

Register addressingDirect addressingIndirect addressingImmediate addressingPC-relative addressing

Plus two specialized addressing modesCircular addressingBit-reverse addressing

Maurizio Palesi 90

Register AddressingRegister AddressingA CPU register contains the operand

Every CPU’s registers can be used (R0-R7, AR0-AR7, DP, IR0, IR1, …)

ABSF R1 ; R1 = |R1|

Maurizio Palesi 91

Direct AddressingDirect Addressing The data address is formed by the concatenation of the

eight LSBs of the data-page pointer (DP) with the 16 LSBs of the instruction word (expr)This results in 256 pages (64K words per page)

Maurizio Palesi 92

Direct AddressingDirect Addressing

ADDI @0BCDEh, R7

00 0000 0000

8A

1234 5678

R7

DP

8ABCDEh

Data memory

Before Instruction

00 1234 5678

8A

1234 5678

R7

DP

8ABCDEh

Data memory

After Instruction

Maurizio Palesi 93

Indirect AddressingIndirect Addressing Specifies the address of an operand in memory through

the contents of Auxiliary registerOptional displacements Index registers

The auxiliary register arithmetic units (ARAUs) perform the unsigned arithmetic

Maurizio Palesi 94

Indirect AddressingIndirect AddressingIndirect addressing with displacement

Maurizio Palesi 95

Indirect AddressingIndirect AddressingIndirect addressing with index register IR0

Maurizio Palesi 96

Indirect AddressingIndirect AddressingIndirect addressing (special cases)

Maurizio Palesi 97

Indirect Addressing - ExampleIndirect Addressing - Example

Indirect addressing with predisplacement add

*+ARn(disp)

Maurizio Palesi 98

Indirect Addressing - ExampleIndirect Addressing - Example

Indirect addressing with postdisplacement add and modify

*++ARn(disp)

Maurizio Palesi 99

Immediate AddressingImmediate Addressing The operand is a 16-bit (short) or 24-bit (long) immediate

value contained in the instruction word Depending on the data types assumed for the instruction,

the immediate operand can beA 2s-complement integeran unsigned integer, or a floating-point number

SUB 1, R0

00 0000 0000R0

Before Instruction

00 FFFF FFFFR0

After Instruction

Maurizio Palesi 100

PC-relative AddressingPC-relative Addressing It adds the contents of the 16 or 24 LSBs of the instruction

word to the PC register The assembler takes the src (a label or address) specified

by the user and generates a displacementThe displacement is equal to[src − (instruction address+1)]

BU Label

1002PC

Before InstructionDecode phase

1005PC

After InstructionExecution phase

; pc=1001h, ; Label = 1005h; --> displacement = 3

Maurizio Palesi 101

Circular AddressingCircular Addressing Many DSP algorithms, such as convolution and correlation, require a

circular buffer in memory In convolution and correlation, the circular buffer acts as a sliding

window that contains the most recent data to process As new data is brought in, the new data overwrites the oldest data

Logical representation Physical representation

Start

End

Maurizio Palesi 102

Circular AddressingCircular Addressing

Logical representation Physical representation

Start

End

value0 value0

value1

value1

value2

value2

Maurizio Palesi 103

Circular AddressingCircular Addressing

Logical representation Physical representation

Start

End

value5

value5

value0 value0

value1

value1

value2

value2

value3

value3

value4

value4

value6 value6

value7

value7

Maurizio Palesi 104

ImplementationImplementationBK Length of the circular buffer

(16 bit, <64K)

The K LSB of the start address of the buffer must be 0K is such that 2K > buffer length

Length of buffer BK register value Starting address of buffer31 31 XXXXXXXXXXXXXXXXXXX0000032 32 XXXXXXXXXXXXXXXXXX000000

1024 1024 XXXXXXXXXXXXX00000000000

Maurizio Palesi 105

Algorithm for Circular AddressingAlgorithm for Circular Addressing

Start

End

Buffer length(BK)

Index

if (0 ≤ index+step < BK)

index = index+step;

else if (index+step ≥ BK)

index = index+step-BK;

else

index = index+step+BK;

Maurizio Palesi 106

Circular Addressing - ExampleCircular Addressing - Example

*ARn++(disp)% ; addr = ARn; ARn = circ(ARn+disp)

*AR0++(5)%; Now AR0 is circ(0+5)=5

2345678...

01

MemoryAddr

*AR0++(2)%; Now AR0 is circ(5+2)=1

*AR0−−(3)%; Now AR0 is circ(1-3)=4

*AR0++(6)%; Now AR0 is circ(4+6)=4

*AR0−−%; Now AR0 is circ(4-1)=3

; AR0 is 0; BK is 6

*AR0%

Maurizio Palesi 107

ISA OverviewISA OverviewThe instruction set contains 113 instructions

Load and store2-operand arithmetic/logical3-operand arithmetic/logicalProgram controlInterlocked operationsParallel operations

Maurizio Palesi 108

Load & StoreLoad & Store The ’C3x supports 13 load and store instructions

Load a word from memory into a registerStore a word from a register into memoryManipulate data on the system stack

Maurizio Palesi 109

2-Operand Instructions2-Operand InstructionsThe ’C3x supports 35 2-operand arithmetic

and logical instructionsThe two operands are the source and

destinationThe source operand can be

Memory wordRegisterPart of the instruction word

The destination operand is always a register

Maurizio Palesi 110

2-Operand Instructions2-Operand Instructions

Maurizio Palesi 111

3-Operand Instructions3-Operand Instructions 3-operand instructions have two source operands and a destination

operand A source operand can be

Memory word Register

The destination is always a register

Maurizio Palesi 112

Program-Control InstructionsProgram-Control Instructions

The program-control instruction group consists of all of those instructions that affect program flow

Maurizio Palesi 113

Low-Power Control InstructionsLow-Power Control Instructions

The low-power control instruction group consists of 3 instructions that affect the low-power modes

Maurizio Palesi 114

Interlocked-Operations InstructionsInterlocked-Operations Instructions

The five interlocked-operations instructions support multiprocessor communication and the use of external signals to allow for powerful synchronization mechanisms

They also ensure the integrity of the communication and result in a high-speed operation

Maurizio Palesi 115

Parallel OperationsParallel Operations The 13 parallel-operations instructions make a

high degree of parallelism possible Some of the ’C3x instructions can occur in pairs

that are executed in parallelParallel loading of registersParallel arithmetic operationsArithmetic/logical instructions used in parallel with a

store instruction

Maurizio Palesi 116

Parallel OperationsParallel Operations Parallel arithmetic with store instructions

Many other

Maurizio Palesi 117

Parallel OperationsParallel Operations Parallel load instructions

Parallel multiply and add/subtract instructions

Maurizio Palesi 118

Repeat ModesRepeat Modes The repeat modes can implement zero-overhead

looping For many algorithms, most execution time is spent

in an inner kernel of code Two instructions

RPTB repeats a block of codeRepeats execution of a block of code a specified number of

times

RPTS repeats a single instructionFetches a single instruction once and then repeats its

execution a number of times– Since the instruction is fetched only once, bus traffic is minimized

Maurizio Palesi 119

Repeat Mode RegistersRepeat Mode Registers RS Repeat start-address register

Holds the address of the first instruction of the block of code to be repeated

RE Repeat end-address registerHolds the address of the last instruction of the

block of code to be repeated (RE≥RS) RC Repeat-counter register

Contains 1 less than the number of times the block remains to be repeatedFor example, to execute a block n times, load

n−1 into RC

Maurizio Palesi 120

BranchesBranchesStandard branchesDelayed branchesConditional delayed branches

Maurizio Palesi 121

Standard BranchesStandard BranchesEmpty the pipeline before performing the

branchResulting in a ’C3x branch taking four cycles

Included in this class are repeats, calls, returns, and traps

Maurizio Palesi 122

Delayed BranchesDelayed BranchesDo not empty the pipeline

Execute the next three instructions before the program counter is modified by the branch

Results in a branch that requires only a single cycle

Maurizio Palesi 123

Conditional Delayed BranchesConditional Delayed Branches

Use the conditions that exist at the end of the instruction immediately preceding the delayed branch

They do not depend on the instructions following the delayed branch

Maurizio Palesi 124

Calls, Traps, and ReturnsCalls, Traps, and Returns Calls and traps provide a means of executing a subroutine

or function while providing a return to the calling routine

Call and Trap instructions store the value of the PC on the stack before changing the PC’s contents

Return instructions use the value on the stack to return execution from traps and calls

Functionally, calls and traps accomplish the same task A subfunction is called and executed, and control is then returned to

the calling function

In Traps Interrupts are automatically disabled when a trap is executed This allows critical code to execute without risk of being interrupted

Traps are generally terminated with a RETI instruction to reenable interrupts

Maurizio Palesi 125

ExamplesExamplesFIR FilterMatrix-Vector Multiplication

Maurizio Palesi 126

Data Structure for FIR FiltersData Structure for FIR Filters Circular addressing is especially useful for the implementation of FIR filters

h(N-3)

h(2)

h(1)

h(0)

h(N-1)

h(N-2)

Impulse response

AR0

x(2)

x(N-3)

x(N-2)

x(N-1)

x(0)

x(1)

Input samples

AR1

Maurizio Palesi 127

FIR Filter CodeFIR Filter Code* Impulse Response .sect ”Impulse_Resp”H .float 1.0 .float 0.99 .float 0.95 ... .float 0.1

* Input BufferX .usect ”Input_Buf”,128

.dataHADDR .word HXADDR .word XN .word 128

0.99

...

0.1

...

?

?

...

?

...1.0

Memory

...H

X

128

H

Addr

Impulse_Resp

Input_Buf

X

HADDR

XADDR

N

...

Maurizio Palesi 128

FIR Filter Code (cnt’d)FIR Filter Code (cnt’d)* Initialization

LDP HADDRLDI @N,BK ; Load block sizeLDI @HADDR,AR0 ; Load pointer to impulse responseLDI @XADDR,AR1 ; Load pointer to input samples

TOP LDF IN,R3 ; Read input sampleSTF R3,*AR1++% ; Store the samplesLDF 0,R0 ; Initialize R0LDF 0,R2 ; Initialize R2

* FilterRPTS N−1 ; Repeat next instructionMPYF3 *AR0++%,*AR1++%,R0|| ADDF3 R0,R2,R2 ; MACADDF R0,R2 ; Last product accumulated

STF R2,Y ; Save resultB TOP ; Repeat

Maurizio Palesi 129

Matrix-Vector MultiplicationMatrix-Vector Multiplication

[P]K× 1=[M]K× N × [V]N× 1

for (i=0; i<K; i++){

p[i] = 0for (j=0; j<N; j++)

p[i] = p[i] + m[i,j] * v[j]}

Maurizio Palesi 130

Matrix-Vector MultiplicationMatrix-Vector Multiplication Data memory organization

Maurizio Palesi 131

Matrix-Vector MultiplicationMatrix-Vector Multiplication* AR0 : ADDRESS OF M(0,0)* AR1 : ADDRESS OF V(0)* AR2 : ADDRESS OF P(0)* AR3 : NUMBER OF ROWS - 1 (K-1)* R1 : NUMBER OF COLUMNS - 2 (N-2)

MAT LDI R1,IR0 ; Number of columns-2 -> IR0

ADDI 2,IR0 ; Number of columns -> IR0

ROWS LDF 0.0,R2 ; Initialize R2

MPYF3 *AR0++(1),*AR1++(1),R0 ; m(i,0) * v(0) -> R0RPTS R1 ; Multiply a row by a column

MPYF3 *AR0++(1),*AR1++(1),R0 ; m(i,j) * v(j) -> R0 || ADDF3 R0,R2,R2 ; m(i,j-1) * v(j-1) + R2 -> R2

SUBI 1,AR3BNZD ROWS ; Counts the no. of rows left

ADDF R0,R2 ; Last accumulate

STF R2,*AR2++(1) ; Result -> p(i)

NOP *––AR1(IR0) ; Set AR1 to point to v(0)

Delayslot

Maurizio Palesi 132

C Programming TipsC Programming Tips After writing your application in C language, debug the

program and determine whether it runs efficiently If the program does not run efficiently

Use the optimizer with –o2 or –o3 options when compilingUse registers to pass parameters (–ms compiling option)Use inlining (–x compiling option)Remove the –g option when compilingFollow some of the efficient code generation tips

Use register variables for often-used variablesPrecompute subexpressionsUse *++ to step through arraysUse structure assignments to copy blocks of data

Maurizio Palesi 133

Use Register VariablesUse Register Variables Exchange one object in memory with another

register float *src, *dest, temp;

do {temp = *++src;*src = *++dest;*dest = temp;

} while (––n);

Maurizio Palesi 134

Precompute Subexpression and use *++Precompute Subexpression and use *++

main() {float a[10], b[10];int i;for (i = 0; i < 10; ++i)

a[i] = (a[i] * 20) + b[i];}

main() {float a[10], b[10];int i;register float *p = a, *q = b;for (i = 0; i < 10; ++i)

*p++ = (*p * 20) + *q++;}

19 cycles

12 cycles

Maurizio Palesi 135

Structure AssignmentsStructure Assignments The compiler generates very efficient code for structure

assignmentsNest objects within structures and use simple assignments to copy

them

int x1, y1, c1;int x2, y2, c2;

x1 = x2;y1 = y2;c1 = c2;

struct Pixel { int x, y, c;};

struct Pixel p1, p2;

p1 = p2;

Maurizio Palesi 136

Hints for Assembly CodingHints for Assembly CodingUse delayed branches

Delayed branches execute in a single cycleRegular branches execute in four cyclesThe next three instructions are executed

whether the branch is taken or notIf fewer than three instructions are required, use the

delayed branch and append NOPs– A reduction in machine cycles still occurs

Maurizio Palesi 137

Hints for Assembly CodingHints for Assembly CodingApply the repeat single/block construct

In this way, loops are achieved with no overhead

Note that using RPTS instruction the executed instruction is not refetched for executionThis frees the buses for operand fetches

Maurizio Palesi 138

Hints for Assembly CodingHints for Assembly CodingUse parallel instructionsMaximize the use of registersUse the cacheUse internal memory instead of external

memoryAvoid pipeline conflicts