Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | octavia-cannon |
View: | 258 times |
Download: | 2 times |
Chapter 2TMS320C6000 Architectural Overview
DSP Lecture 02
2
• Describe C6000 CPU architecture.• Introduce some basic instructions.• Describe the C6000 memory map.• Provide an overview of the peripherals.• Performance• Software Development
Learning Objectives
3
General DSP System Block Diagram
PPEERRIIPPHHEERRAALLSS
Central Central
ProcessingProcessing
UnitUnit
Internal MemoryInternal Memory
Internal BusesInternal Buses
ExternalExternal
MemoryMemory
4
Implementation of Sum of Products (SOP)
It has been shown in It has been shown in Chapter 1 that SOP is the Chapter 1 that SOP is the key element for most DSP key element for most DSP algorithms.algorithms.
So let’s write the code for So let’s write the code for this algorithm and at the this algorithm and at the same time discover the same time discover the C6000 architecture.C6000 architecture.
Two basicTwo basic
operations are requiredoperations are required
for this algorithm.for this algorithm.
(1) (1) MultiplicationMultiplication
(2) (2) Addition Addition
Therefore two basicTherefore two basic
instructions are requiredinstructions are required
Y =Y =NN
aann xxnnn = 1n = 1
**
= a= a11 ** x x1 1 + + aa2 2 * * xx22 ++... ... ++ a aN N ** xxNN
5
Two basicTwo basic
operations are requiredoperations are required
for this algorithm.for this algorithm.
(1) (1) MultiplicationMultiplication
(2) (2) Addition Addition
Therefore two basicTherefore two basic
instructions are requiredinstructions are required
Implementation of Sum of Products (SOP)
Y =Y =NN
aann xxnnn = 1n = 1
**So let’s implement the SOP So let’s implement the SOP algorithm!algorithm!
The implementation in this The implementation in this module will be done in module will be done in assembly.assembly.
= a= a11 ** x x1 1 + + aa2 2 ** xx22 ++... ... ++ a aN N ** xxNN
6
Multiply (MPY)
The The multiplicationmultiplication of of aa11 by x by x1 1 is done in is done in assembly by the following instruction:assembly by the following instruction:
MPYMPY a1, x1, Ya1, x1, Y
This instruction is performed by a This instruction is performed by a multiplier unit that is called “.M”multiplier unit that is called “.M”
Y =Y =NN
aann xxnnn = 1n = 1
**
= a= a11 ** x x1 1 + + aa2 2 ** xx22 ++... ... ++ a aN N ** xxNN
7
Multiply (.M unit)
.M.M.M.M
Y =Y =4040
aann x xnnn = 1n = 1
**
The . M unit performs multiplications in The . M unit performs multiplications in hardware hardware
MPYMPY .M.M a1, x1, Ya1, x1, Y
Note: 16-bit by 16-bit multiplier provides a 32-bit result.Note: 16-bit by 16-bit multiplier provides a 32-bit result.
32-bit by 32-bit multiplier provides a 64-bit result.32-bit by 32-bit multiplier provides a 64-bit result.
8
Addition (.?)
.M.M.M.M
.?.?.?.?
Y =Y =4040
aann x xnnn = 1n = 1
**
MPYMPY .M.M a1, x1, proda1, x1, prod
ADDADD .?.? Y, prod, YY, prod, Y
9
Add (.L unit)
.M.M.M.M
.L.L.L.L
Y =Y =4040
aann x xnnn = 1n = 1
**
MPYMPY .M.M a1, x1, proda1, x1, prod
ADDADD .L.L Y, prod, YY, prod, Y
RISC processors such as the C6000 use registers to RISC processors such as the C6000 use registers to hold the operands, so lets change this code.hold the operands, so lets change this code.
10
Register File - A
Y =Y =4040
aann x xnnn = 1n = 1
**
MPYMPY .M.M a1, x1, proda1, x1, prod
ADDADD .L.L Y, prod, YY, prod, Y
Let us correct this by replacing a, x, prod and Y by the Let us correct this by replacing a, x, prod and Y by the registers as shown above.registers as shown above.
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
11
Specifying Register Names
Y =Y =4040
aann x xnnn = 1n = 1
**
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
The registers A0, A1, A3 and A4 contain the values to The registers A0, A1, A3 and A4 contain the values to be used by the instructions.be used by the instructions.
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
12
Specifying Register Names
Y =Y =4040
aann x xnnn = 1n = 1
**
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
Register File A contains 16 registers (A0 -A15) which Register File A contains 16 registers (A0 -A15) which are 32-bits wide.are 32-bits wide.
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
13
Data loading
Q: How do we load the Q: How do we load the operands into the registers?operands into the registers?
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
14
Load Unit “.D”
A: The operands are loaded A: The operands are loaded into the registers by loading into the registers by loading them from the memory them from the memory using the .D unit.using the .D unit.
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
Q: How do we load the Q: How do we load the operands into the registers?operands into the registers?
15
Load Unit “.D”
It is worth noting at this It is worth noting at this stage that the only way to stage that the only way to access memory is through the access memory is through the .D unit..D unit.
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
16
Load Instruction
Q: Which instruction(s) can be Q: Which instruction(s) can be used for loading operands used for loading operands from the memory to the from the memory to the registers?registers?
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
17
Load Instructions (LDB, LDH,LDW,LDDW)
A: The load instructions.A: The load instructions..M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
a1a1x1x1
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
Q: Which instruction(s) can be Q: Which instruction(s) can be used for loading operands used for loading operands from the memory to the from the memory to the registers?registers?
18
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
Before using the load unit you Before using the load unit you have to be aware that this have to be aware that this processor is byte addressable, processor is byte addressable, which means that each byte is which means that each byte is represented by a unique represented by a unique address.address.
Also the addresses are 32-bit Also the addresses are 32-bit wide.wide.
addressaddress
FFFFFFFFFFFFFFFF
19
The syntax for the load The syntax for the load instruction is:instruction is:
Where:Where:
Rn is a register that contains Rn is a register that contains the address of the operand to the address of the operand to be loaded be loaded
andand
Rm is the destination register.Rm is the destination register.
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
a1a1x1x1
prodprod
16-bits16-bits
YY
addressaddress
FFFFFFFFFFFFFFFF
LD *Rn,RmLD *Rn,Rm
20
The syntax for the load The syntax for the load instruction is:instruction is:
The question now is how many The question now is how many bytes are going to be loaded bytes are going to be loaded into the destination register?into the destination register?
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
a1a1x1x1
prodprod
16-bits16-bits
YY
addressaddress
FFFFFFFFFFFFFFFF
LD *Rn,RmLD *Rn,Rm
21
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
a1a1x1x1
prodprod
16-bits16-bits
YY
addressaddress
FFFFFFFFFFFFFFFF
The answer, is that it depends The answer, is that it depends on the instruction you choose:on the instruction you choose:• LDB: loads one byte (8-bit)LDB: loads one byte (8-bit)
• LDH: loads half word (16-bit)LDH: loads half word (16-bit)
• LDW: loads a word (32-bit)LDW: loads a word (32-bit)
• LDDW: loads a double word (64-LDDW: loads a double word (64-bit)bit)
Note: LD on its own does not Note: LD on its own does not existexist..
22
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
addressaddress
FFFFFFFFFFFFFFFF
0xB0xB0xA0xA
0xD0xD0xC0xC
Example:Example:
If we assume that If we assume that A5 = 0x4A5 = 0x4 then: then:
(1) (1) LDB *A5, A7 ; gives A7 = 0x00000001LDB *A5, A7 ; gives A7 = 0x00000001
(2) (2) LDH *A5,A7; gives A7 = 0x00000201LDH *A5,A7; gives A7 = 0x00000201
(3) (3) LDW *A5,A7; gives A7 = 0x04030201LDW *A5,A7; gives A7 = 0x04030201
(4) (4) LDDW *A5,A7:A6; gives A7:A6 = LDDW *A5,A7:A6; gives A7:A6 = 0x08070605040302010x0807060504030201
0x10x10x20x2
0x30x30x40x4
0x50x50x60x6
0x70x70x80x8
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
0011
23
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
addressaddress
FFFFFFFFFFFFFFFF
0xB0xB0xA0xA
0xD0xD0xC0xC
Example:Example:
If we assume that If we assume that A5 = 0x4A5 = 0x4 then: then:
(1) (1) LDB *A5, A7LDB *A5, A7 ; gives A7 = 0x00000001 ; gives A7 = 0x00000001
(2) (2) LDH *A5,A7; gives A7 = 0x00000201LDH *A5,A7; gives A7 = 0x00000201
(3) (3) LDW *A5,A7; gives A7 = 0x04030201LDW *A5,A7; gives A7 = 0x04030201
(4) (4) LDDW *A5,A7:A6; gives A7:A6 = LDDW *A5,A7:A6; gives A7:A6 = 0x08070605040302010x0807060504030201
0x10x10x20x2
0x30x30x40x4
0x50x50x60x6
0x70x70x80x8
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
0011
24
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
addressaddress
FFFFFFFFFFFFFFFF
0xB0xB0xA0xA
0xD0xD0xC0xC
Example:Example:
If we assume that If we assume that A5 = 0x4A5 = 0x4 then: then:
(1) (1) LDB *A5, A7 ; gives A7 = 0x00000001LDB *A5, A7 ; gives A7 = 0x00000001
(2) (2) LDH *A5,A7LDH *A5,A7; gives A7 = 0x00000201; gives A7 = 0x00000201
(3) (3) LDW *A5,A7; gives A7 = 0x04030201LDW *A5,A7; gives A7 = 0x04030201
(4) (4) LDDW *A5,A7:A6; gives A7:A6 = LDDW *A5,A7:A6; gives A7:A6 = 0x08070605040302010x0807060504030201
0x10x10x20x2
0x30x30x40x4
0x50x50x60x6
0x70x70x80x8
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
0011
25
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
addressaddress
FFFFFFFFFFFFFFFF
0xB0xB0xA0xA
0xD0xD0xC0xC
Example:Example:
If we assume that If we assume that A5 = 0x4A5 = 0x4 then: then:
(1) (1) LDB *A5, A7 ; gives A7 = 0x00000001LDB *A5, A7 ; gives A7 = 0x00000001
(2) (2) LDH *A5,A7; gives A7 = 0x00000201LDH *A5,A7; gives A7 = 0x00000201
(3) (3) LDW *A5,A7LDW *A5,A7; gives A7 = 0x04030201; gives A7 = 0x04030201
(4) (4) LDDW *A5,A7:A6; gives A7:A6 = LDDW *A5,A7:A6; gives A7:A6 = 0x08070605040302010x0807060504030201
0x10x10x20x2
0x30x30x40x4
0x50x50x60x6
0x70x70x80x8
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
0011
26
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
addressaddress
FFFFFFFFFFFFFFFF
0xB0xB0xA0xA
0xD0xD0xC0xC
Example:Example:
If we assume that If we assume that A5 = 0x4A5 = 0x4 then: then:
(1) (1) LDB *A5, A7 ; gives A7 = 0x00000001LDB *A5, A7 ; gives A7 = 0x00000001
(2) (2) LDH *A5,A7; gives A7 = 0x00000201LDH *A5,A7; gives A7 = 0x00000201
(3) (3) LDW *A5,A7; gives A7 = 0x04030201LDW *A5,A7; gives A7 = 0x04030201
(4) (4) LDDW *A5,A7:A6LDDW *A5,A7:A6; gives A7:A6 = ; gives A7:A6 = 0x08070605040302010x0807060504030201
0x10x10x20x2
0x30x30x40x4
0x50x50x60x6
0x70x70x80x8
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
0011
27
Using the Load Instructions
0000000000000000
0000000200000002
0000000400000004
0000000600000006
0000000800000008
DataData
16-bits16-bits
addressaddress
FFFFFFFFFFFFFFFF
0xB0xB0xA0xA
0xD0xD0xC0xC
Question: Question:
If data can only be accessed by the If data can only be accessed by the load instruction and the load instruction and the .D.D unit, unit, how can we load the register how can we load the register pointer pointer RnRn in the first place? in the first place?
0x10x10x20x2
0x30x30x40x4
0x50x50x60x6
0x70x70x80x8
The syntax for the load The syntax for the load instruction is:instruction is:
LD *Rn,RmLD *Rn,Rm
28
• The instruction MVKL will allow a move of a 16-bit constant into a register as shown below:
MVKL .? a, A5
(‘a’ is a constant or label)
• How many bits represent a full address?
32 bits
• So why does the instruction not allow a 32-bit move?
All instructions are 32-bit wide (see instruction opcode).
Loading the Pointer Rn
29
• To solve this problem another instruction is available:
MVKH
Loading the Pointer Rn
eg.eg. MVKHMVKH .?.? a, A5a, A5
(‘a’ is a constant or label) (‘a’ is a constant or label)
ahah
ahah xx
alal aa
A5A5
MVKL MVKL a, A5a, A5
MVKH MVKH a, A5a, A5
• Finally, to move the 32-bit address to a register we can use:
30
Loading the Pointer Rn
MVKLMVKL 0x1234FABC, A5 0x1234FABC, A5
A5 = 0xFFFFFABC ; WrongA5 = 0xFFFFFABC ; Wrong
Example 1 Example 1
A5 = 0x87654321 A5 = 0x87654321
MVKLMVKL 0x1234FABC, A5 0x1234FABC, A5
A5 = 0xFFFFFABC (sign extension)A5 = 0xFFFFFABC (sign extension)
MVKHMVKH 0x1234FABC, A5 0x1234FABC, A5
A5 = 0x1234FABC ; OKA5 = 0x1234FABC ; OK
Example 2Example 2
MVKHMVKH 0x1234FABC, A5 0x1234FABC, A5
A5 = 0x12344321A5 = 0x12344321
• Always use MVKL then MVKH, look at the following examples:
31
LDH, MVKL and MVKH
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A4A4
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
MVKL MVKL pt1, A5pt1, A5 MVKH MVKH pt1, A5 pt1, A5
MVKL MVKL pt2, A6pt2, A6 MVKH MVKH pt2, A6pt2, A6
LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
pt1 and pt2 point to some locations pt1 and pt2 point to some locations
in the data memory.in the data memory.
32
Creating a loop
MVKL MVKL pt1, A5pt1, A5 MVKH MVKH pt1, A5 pt1, A5
MVKL MVKL pt2, A6pt2, A6 MVKH MVKH pt2, A6pt2, A6
LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
So far we have only So far we have only implemented the SOP implemented the SOP for one tap only, i.e.for one tap only, i.e.
Y= aY= a11 ** x x11
So let’s create a loop So let’s create a loop so that we can so that we can implement the SOP implement the SOP for N Taps.for N Taps.
33
Creating a loop
With the C6000 processors With the C6000 processors there are no dedicated there are no dedicated
instructions such as block instructions such as block repeat. The loop is created repeat. The loop is created
using the B instruction.using the B instruction.
So far we have only So far we have only implemented the SOP implemented the SOP for one tap only, i.e.for one tap only, i.e.
Y= aY= a11 * * xx11
So let’s create a loop So let’s create a loop so that we can so that we can implement the SOP implement the SOP for N Taps.for N Taps.
34
What are the steps for creating a loop
1. 1. Create a label to branch to.Create a label to branch to.
2. 2. Add a branch instruction, B.Add a branch instruction, B.
3.3. Create a loop counter.Create a loop counter.
4. 4. Add an instruction to decrement the loop counter.Add an instruction to decrement the loop counter.
5. 5. Make the branch conditional based on the value in Make the branch conditional based on the value in
the loop counter.the loop counter.
35
1. Create a label to branch to
MVKLMVKL pt1, A5pt1, A5 MVKH MVKH pt1, A5 pt1, A5
MVKL MVKL pt2, A6pt2, A6 MVKH MVKH pt2, A6pt2, A6
loop loop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4 A4, A3, A4
36
MVKLMVKL pt1, A5pt1, A5 MVKH MVKH pt1, A5 pt1, A5
MVKL MVKL pt2, A6pt2, A6 MVKH MVKH pt2, A6pt2, A6
loop loop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
BB .?.? looploop
2. Add a branch instruction, B.
37
Which unit is used by the B instruction?
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
.S.S.S.S
MVKLMVKL pt1, A5pt1, A5 MVKH MVKH pt1, A5 pt1, A5
MVKL MVKL pt2, A6pt2, A6 MVKH MVKH pt2, A6pt2, A6
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
BB .?.? looploop
38Data MemoryData Memory
Which unit is used by the B instruction?
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.S.S.S.S
MVKLMVKL .S .S pt1, A5pt1, A5 MVKH MVKH .S.S pt1, A5 pt1, A5
MVKL MVKL .S.S pt2, A6pt2, A6 MVKH MVKH .S.S pt2, A6pt2, A6
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
BB .S.S looploop
39Data MemoryData Memory
3. Create a loop counter.
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.S.S.S.S
MVKLMVKL .S .S pt1, A5pt1, A5 MVKH MVKH .S .S pt1, A5 pt1, A5
MVKL MVKL .S .S pt2, A6pt2, A6 MVKH MVKH .S .S pt2, A6pt2, A6
MVKLMVKL .S.S count, B0count, B0
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
BB .S.S looploop
B registers will be introduced laterB registers will be introduced later
40
4. Decrement the loop counter
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
Data MemoryData Memory
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.S.S.S.S
MVKLMVKL .S .S pt1, A5pt1, A5 MVKH MVKH .S .S pt1, A5 pt1, A5
MVKL MVKL .S .S pt2, A6pt2, A6 MVKH MVKH .S .S pt2, A6pt2, A6
MVKLMVKL .S.S count, B0count, B0
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
BB .S.S looploop
41
• What is the syntax for making instruction conditional?
[condition] Instruction Label
e.g.
[B1] B loop
(1) (1) The The conditioncondition can be one of the following can be one of the following registers: registers: A1, A2, B0, B1, B2.A1, A2, B0, B1, B2.
(2) (2) Any instruction can be conditional.Any instruction can be conditional.
5. Make the branch conditional based on the value in the loop counter
42
• The condition can be inverted by adding the exclamation symbol “!” as follows:
[!condition] Instruction Label
e.g.
[!B0] B loop ;branch if B0 = 0
[B0] B loop ;branch if B0 != 0
5. Make the branch conditional based on the value in the loop counter
43Data MemoryData Memory
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.M.M.M.M
.L.L.L.L
A0A0
A1A1
A2A2
A3A3
A15A15
Register File ARegister File A
..
..
..
aaxx
prodprod
32-bits32-bits
YY
.D.D.D.D
.S.S.S.S
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 count, B0count, B0
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
5. Make the branch conditional
44
Case 1: Case 1: B .S1B .S1 labellabel Relative branch.Relative branch.
Label limited to +/- 2Label limited to +/- 22020 offset. offset.
More on the Branch Instruction (1)
With this processor all the instructions are With this processor all the instructions are encoded in a 32-bit.encoded in a 32-bit.
Therefore the label must have a dynamic range Therefore the label must have a dynamic range of less than 32-bit as the instruction B has to be of less than 32-bit as the instruction B has to be coded.coded.
21-bit relative address21-bit relative addressBB
32-bit32-bit
45
More on the Branch Instruction (2)
By specifying a register as an operand instead By specifying a register as an operand instead of a label, it is possible to have an absolute of a label, it is possible to have an absolute branch.branch.
This will allow a dynamic range of 2This will allow a dynamic range of 23232..
Case 2: Case 2: B .SB .S22 registerregister Absolute branch.Absolute branch.
Operates on .S2 ONLY!Operates on .S2 ONLY!
5-bit register 5-bit register codecodeBB
32-bit32-bit
46
Testing the code
This code performs the followingThis code performs the following
operations:operations:
aa00*x*x00 + a + a00*x*x00 + a + a00*x*x00 + … + a + … + a00*x*x00
However, we would like to perform:However, we would like to perform:
aa00*x*x00 + a + a11*x*x11 + a + a22*x*x22 + … + a + … + aNN*x*xNN
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 count, B0count, B0
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
47
Modifying the pointers
The solution is to modify the pointers The solution is to modify the pointers
A5 and A6.A5 and A6.
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 count, B0count, B0
looploop LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
48
Indexing Pointers
DescriptionDescription
PointerPointer
SyntaxSyntax PointerPointerModifiedModified
**RR NoNo
R R can be any registercan be any register
In this case the pointers are used but not modified.In this case the pointers are used but not modified.
49
Indexing Pointers
DescriptionDescription
PointerPointer
+ Pre-offset+ Pre-offset
- Pre-offset- Pre-offset
SyntaxSyntax PointerPointerModifiedModified
*R*R
*+R[*+R[dispdisp]]
*-R[*-R[dispdisp]]
NoNo
NoNo
NoNo
[[ disp] specifies the number of elements size in DW (64-bit), disp] specifies the number of elements size in DW (64-bit), W (32-bit), H (16-bit), or B (8-bit).W (32-bit), H (16-bit), or B (8-bit).
disp = disp = RR or 5-bit constant. or 5-bit constant. R R can be any register.can be any register.
In this case the pointers are modified In this case the pointers are modified BEFOREBEFORE being used being used
and and RESTOREDRESTORED to their previous values. to their previous values.
50
Indexing Pointers
DescriptionDescription
PointerPointer
+ Pre-offset+ Pre-offset
- Pre-offset- Pre-offset
Pre-incrementPre-increment
Pre-decrementPre-decrement
SyntaxSyntax PointerPointerModifiedModified
*R*R
*+R[*+R[dispdisp]]
*-R[*-R[dispdisp]]
*++R[*++R[dispdisp]]
*--R[*--R[dispdisp]]
NoNo
NoNo
NoNo
YesYes
YesYes
In this case the pointers are modified In this case the pointers are modified BEFOREBEFORE being used being used
and and NOTNOT RESTORED to their Previous Values. RESTORED to their Previous Values.
51
Indexing Pointers
DescriptionDescription
PointerPointer
+ Pre-offset+ Pre-offset
- Pre-offset- Pre-offset
Pre-incrementPre-increment
Pre-decrementPre-decrement
Post-incrementPost-increment
Post-decrementPost-decrement
SyntaxSyntax PointerPointerModifiedModified
*R*R
*+R[*+R[dispdisp]]
*-R[*-R[dispdisp]]
*++R[*++R[dispdisp]]
*--R[*--R[dispdisp]]
*R++[*R++[dispdisp]]
*R--[*R--[dispdisp]]
NoNo
NoNo
NoNo
YesYes
YesYes
YesYes
YesYes
In this case the pointers are modified In this case the pointers are modified AFTERAFTER being used being used
and and NOTNOT RESTORED to their Previous Values. RESTORED to their Previous Values.
52
Indexing Pointers
DescriptionDescription
PointerPointer
+ Pre-offset+ Pre-offset
- Pre-offset- Pre-offset
Pre-incrementPre-increment
Pre-decrementPre-decrement
Post-incrementPost-increment
Post-decrementPost-decrement
SyntaxSyntax PointerPointerModifiedModified
*R*R
*+R[*+R[dispdisp]]
*-R[*-R[dispdisp]]
*++R[*++R[dispdisp]]
*--R[*--R[dispdisp]]
*R++[*R++[dispdisp]]
*R--[*R--[dispdisp]]
NoNo
NoNo
NoNo
YesYes
YesYes
YesYes
YesYes
[disp] specifies # elements - size in DW, W, H, or B.[disp] specifies # elements - size in DW, W, H, or B. disp = disp = RR or 5-bit constant. or 5-bit constant. R R can be any register.can be any register.
53
Modify and testing the code
This code now performs the followingThis code now performs the following
operations:operations:
aa00*x*x00 + a + a11*x*x11 + a + a22*x*x22 + ... + a + ... + aNN*x*xNN
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 count, B0count, B0
looploop LDHLDH .D.D *A5++*A5++, A0, A0
LDHLDH .D.D *A6++*A6++, A1, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
54
Store the final result
This code now performs the followingThis code now performs the following
operations:operations:
aa00*x*x00 + a + a11*x*x11 + a + a22*x*x22 + ... + a + ... + aNN*x*xNN
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 count, B0count, B0
loop loop LDHLDH .D.D *A5++, A0*A5++, A0
LDHLDH .D.D *A6++, A1*A6++, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
STHSTH .D.D A4, *A7A4, *A7
55
Store the final result
The Pointer A7 has not been initialized.The Pointer A7 has not been initialized.
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 count, B0count, B0
loop loop LDHLDH .D.D *A5++, A0*A5++, A0
LDHLDH .D.D *A6++, A1*A6++, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
STHSTH .D.D A4, *A7A4, *A7
56
Store the final result
The Pointer A7 is now initialized.The Pointer A7 is now initialized.
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 pt3, A7pt3, A7MVKHMVKH .S2.S2 pt3, A7pt3, A7MVKLMVKL .S2.S2 count, B0count, B0
loop loop LDHLDH .D.D *A5++, A0*A5++, A0
LDHLDH .D.D *A6++, A1*A6++, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
STHSTH .D.D A4, *A7A4, *A7
57
What is the initial value of A4?
A4 is used as an accumulator,A4 is used as an accumulator,
so it needs to be reset to zero.so it needs to be reset to zero.
MVKLMVKL .S2 .S2 pt1, A5pt1, A5 MVKH MVKH .S2 .S2 pt1, A5 pt1, A5
MVKL MVKL .S2 .S2 pt2, A6pt2, A6 MVKH MVKH .S2 .S2 pt2, A6pt2, A6
MVKLMVKL .S2.S2 pt3, A7pt3, A7MVKHMVKH .S2.S2 pt3, A7pt3, A7MVKLMVKL .S2.S2 count, B0count, B0ZEROZERO .L.L A4A4
loop loop LDHLDH .D.D *A5++, A0*A5++, A0
LDHLDH .D.D *A6++, A1*A6++, A1
MPYMPY .M.M A0, A1, A3A0, A1, A3
ADDADD .L.L A4, A3, A4A4, A3, A4
SUBSUB .S.S B0, 1, B0B0, 1, B0
[B0][B0] BB .S.S looploop
STHSTH .D.D A4, *A7A4, *A7
58
How can we add more processing power to
this processor?
.S1.S1.S1.S1
.M1.M1.M1.M1
.L1.L1.L1.L1
.D1.D1.D1.D1
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
..
..
..
Data MemoryData Memory
A15A15
32-bits32-bits
Increasing the processing power!
59
(1) Increase the clock frequency.
.S1.S1.S1.S1
.M1.M1.M1.M1
.L1.L1.L1.L1
.D1.D1.D1.D1
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
..
..
..
Data MemoryData Memory
A15A15
32-bits32-bits
Increasing the processing power!
(2)(2) Increase the number Increase the number of Processing units.of Processing units.
60
To increase the Processing Power, this processor has two sides (A and B or 1 and 2)
Data MemoryData Memory
.S1.S1.S1.S1
.M1.M1.M1.M1
.L1.L1.L1.L1
.D1.D1.D1.D1
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
..
..
..
A15A15
32-bits32-bits
.S2.S2.S2.S2
.M2.M2.M2.M2
.L2.L2.L2.L2
.D2.D2.D2.D2
B0B0B1B1B2B2B3B3B4B4
Register File BRegister File B
..
..
..
B15B15
32-bits32-bits
61
Can the two sides exchange operands in order to increase performance?
Data MemoryData Memory
.S1.S1.S1.S1
.M1.M1.M1.M1
.L1.L1.L1.L1
.D1.D1.D1.D1
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
..
..
..
A15A15
32-bits32-bits
B15B15
.S2.S2.S2.S2
.M2.M2.M2.M2
.L2.L2.L2.L2
.D2.D2.D2.D2
B0B0B1B1B2B2B3B3B4B4
Register File BRegister File B
..
..
..
32-bits32-bits
62
The answer is YES but there are limitations.
To exchange operands between the two To exchange operands between the two sides, some cross paths or links are sides, some cross paths or links are required.required.
What is a cross path?What is a cross path? A cross path links one side of the CPU to A cross path links one side of the CPU to
the other.the other. There are two types of cross paths:There are two types of cross paths:
DataData cross paths.cross paths. AddressAddress cross paths.cross paths.
63
Data Cross Paths
Data cross paths can also be referred to Data cross paths can also be referred to as register file cross paths.as register file cross paths.
These cross paths allow operands from These cross paths allow operands from one side to be used by the other side.one side to be used by the other side.
There are only two cross paths:There are only two cross paths: one path which conveys data from side B one path which conveys data from side B
to side A, 1X.to side A, 1X. one path which conveys data from side A one path which conveys data from side A
to side B, 2X.to side B, 2X.
64
TMS320C67x Data-Path
65
Data Cross Paths
Data cross paths only apply to the .L, .S Data cross paths only apply to the .L, .S and .M units.and .M units.
The data cross paths are very useful, The data cross paths are very useful, however there are some limitations in however there are some limitations in their use.their use.
66
Data Cross Path Limitations
AA
22xx
.L1.L1
.M1.M1
.S1.S1
BB
11xx
<src><src>
<src><src><dst><dst>
(1) The destination register must be on same side as unit.
(2) Source registers - up to one cross path per execute packet per side.
Execute packetExecute packet: group of instructions that : group of instructions that execute simultaneously.execute simultaneously.
67
Data Cross Path Limitations
AA
22xx
.L1.L1
.M1.M1
.S1.S1
BB
11xx
<src><src>
<src><src><dst><dst>
eg:ADD .L1x A0,A1,B2MPY .M1x A0,B6,A9SUB .S1x A8,B2,A8
|| ADD .L1x A0,B0,A2
|| || Means that the SUB and ADD Means that the SUB and ADD belong to the same fetch packet, belong to the same fetch packet, therefore execute therefore execute simultaneously.simultaneously.
68
Data Cross Path Limitations
AA
22xx
.L1.L1
.M1.M1
.S1.S1
BB
11xx
<src><src>
<src><src><dst><dst>
eg:ADD .L1x A0,A1,B2MPY .M1x A0,B6,A9SUB .S1x A8,B2,A8
|| ADD .L1x A0,B0,A2
NOT VALID!NOT VALID!
69
Data Cross Paths for both sides
AA
22xx
.L1.L1
.M1.M1
.S1.S1
BB
11xx
<src><src>
<src><src><dst><dst>
.L2.L2
.M2.M2
.S2.S2
<dst><dst>
<src><src>
<src><src>
70
Address cross paths
.D1.D1AA
AddrAddr
DataData
LDW.D1T1 *A0,A5LDW.D1T1 *A0,A5STW.D1T1 A5,*A0STW.D1T1 A5,*A0LDW.D1T1 *A0,A5LDW.D1T1 *A0,A5STW.D1T1 A5,*A0STW.D1T1 A5,*A0
(1) The pointer must be on the same side of the unit.
71
Load or store to either side
.D1.D1AA
*A0*A0
BB
Data1Data1 A5A5
Data2Data2 B5B5
DA1 = T1DA1 = T1
DA2 = T2DA2 = T2LDW.D1T1 *A0,A5LDW.D1T1 *A0,A5LDW.D1T2 *A0,B5LDW.D1T2 *A0,B5LDW.D1T1 *A0,A5LDW.D1T1 *A0,A5LDW.D1T2 *A0,B5LDW.D1T2 *A0,B5
72
Standard Parallel Loads
.D1.D1AA
A5A5
*A0*A0
BBB5B5
.D2.D2
Data1Data1
*B0*B0
LDW.D1T1 *A0,A5LDW.D1T1 *A0,A5|| LDW.D2T2 *B0,B5|| LDW.D2T2 *B0,B5 LDW.D1T1 *A0,A5LDW.D1T1 *A0,A5|| LDW.D2T2 *B0,B5|| LDW.D2T2 *B0,B5
DA1 = T1DA1 = T1
DA2 = T2DA2 = T2
73
Parallel Load/Store using address cross paths
.D1.D1AA
A5A5
*A0*A0
BBB5B5
.D2.D2
Data1Data1
*B0*B0
LDW.D1T2 *A0,B5LDW.D1T2 *A0,B5|| STW.D2T1 A5,*B0|| STW.D2T1 A5,*B0 LDW.D1T2 *A0,B5LDW.D1T2 *A0,B5|| STW.D2T1 A5,*B0|| STW.D2T1 A5,*B0
DA1 = T1DA1 = T1
DA2 = T2DA2 = T2
74
Fill the blanks ... Does this work?
.D1.D1AA
*A0*A0
BB
.D2.D2
Data1Data1
*B0*B0
LDW.D1__ *A0,B5LDW.D1__ *A0,B5|| STW.D2__ B6,*B0|| STW.D2__ B6,*B0 LDW.D1__ *A0,B5LDW.D1__ *A0,B5|| STW.D2__ B6,*B0|| STW.D2__ B6,*B0
DA1 = T1DA1 = T1
DA2 = T2DA2 = T2
75
Not Allowed! Parallel accesses: both cross or neither cross
.D1.D1AA
*A0*A0
BBB5B5
B6B6
.D2.D2
Data1Data1
*B0*B0
LDW.D1LDW.D1T2T2 *A0,B5 *A0,B5|| STW.D2|| STW.D2T2T2 B6,*B0 B6,*B0 LDW.D1LDW.D1T2T2 *A0,B5 *A0,B5|| STW.D2|| STW.D2T2T2 B6,*B0 B6,*B0
DA2 = T2DA2 = T2
76
Conditions Don’t Use Cross Paths
• If a conditional register comes from the opposite side, it does NOT use a data or address cross-path.
• Examples:
[B2] ADD .L1 A2,A0,A4 [A1] LDW .D2 *B0,B5
77
CPUCPURef GuideRef Guide
Full CPU DatapathFull CPU Datapath(Pg 2-2)(Pg 2-2)
‘C62x Data-Path Summary
78
‘C67x Data-Path Summary ‘‘C67x C67x
79
Cross Paths - Summary
• Data– Destination register on same side as unit.– Source registers - up to one cross path per execute
packet per side.– Use “x” to indicate cross-path.
• Address– Pointer must be on same side as unit.– Data can be transferred to/from either side.– Parallel accesses: both cross or neither cross.
Conditionals Don’t Use Cross Paths.
80
Code Review (using side A only)
MVKMVK .S1.S1 40, A240, A2 ; A2 = 40, loop count; A2 = 40, loop count
loop:loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0 ; A0 = a(n); A0 = a(n)
LDHLDH .D1.D1 *A6++, A1*A6++, A1 ; A1 = x(n); A1 = x(n)
MPYMPY .M1.M1 A0, A1, A3A0, A1, A3 ; A3 = a(n) * x(n); A3 = a(n) * x(n)
ADDADD .L1.L1 A3, A4, A4A3, A4, A4 ; Y = Y + A3; Y = Y + A3
SUBSUB .L1.L1 A2, 1, A2A2, 1, A2 ; decrement loop count; decrement loop count
[A2][A2] BB .S1.S1 looploop ; if A2 ; if A2 0, branch 0, branch
STHSTH .D1.D1 A4, *A7A4, *A7 ; *A7 = Y; *A7 = Y
Y =Y =4040 aann x xnn
n = 1n = 1**
Note: Assume that A4 was previously cleared and the pointers are initialised.Note: Assume that A4 was previously cleared and the pointers are initialised.
104
Dual Resources : Twice as Nice
A0A1A2A3A4
Register File A
A15or
A31
A5A6A7
cn
xn
prdsum
cnt
..
*c*x*y
.M1.M1
.L1.L1
.S1.S1
.D1.D1
.M2.M2
.L2.L2
.S2.S2
.D2.D2
Register File B
B0B1B2B3B4
B15or
B31
B5B6B7..
32-bits
....
32-bits
Our final view of the sum of products example...
105
MVK .S1 40, A2
loop: LDH .D1 *A5++, A0
LDH .D1 *A6++, A1
MPY .M1 A0, A1, A3
ADD .L1 A4, A3, A4
SUB .S1 A2, 1, A2
[A2] B .S1 loop
STW .D1 A4, *A7
y =40
cn xnn = 1
*
Optional - Resource Specific Coding
A0A1A2A3A4
Register File A
A15or
A31
A5A6A7
cn
xn
prdsum
cnt
..
*c*x*y
.M1.M1
.L1.L1
.S1.S1
.D1.D1
32-bits
..
It’s easier to use symbols rather thanregister names, but you can use
either method.
106
TMS320C6000 Instruction Set
107
'C6000 System Block Diagram
ExternalMemory
.D1
.M1
.L1
.S1
.D2
.M2
.L2
.S2
Re
gg
iste
r Set B
Re
gis
ter S
et A
CPU
On-chipMemoryP
ERIPHERALS
Internal Buses
To summarize each units’ instructions ...
108
‘C62x RISC-like instruction set
.L .L
.D .D
.S .S
.M .M
No Unit UsedIDLENOP
.S UnitNEGNOT ORSETSHLSHRSSHLSUBSUB2XORZERO
ADDADDKADD2ANDBCLREXTMVMVCMVKMVKH
.L UnitNOTORSADDSATSSUBSUBSUBCXORZERO
ABSADDANDCMPEQCMPGTCMPLTLMBDMVNEGNORM
.M UnitSMPYSMPYH
MPYMPYHMPYLHMPYHL
.D Unit
NEGSTB (B/H/W) SUBSUBAB (B/H/W) ZERO
ADDADDAB (B/H/W)LDB (B/H/W) MV
109
'C62x RISC-Like Instruction Set (by category)
ArithmeticArithmeticABSABSADDADDADDAADDAADDKADDKADD2ADD2MPYMPYMPYHMPYHNEGNEGSMPYSMPYSMPYHSMPYHSADDSADDSATSATSSUBSSUBSUBSUBSUBASUBASUBCSUBCSUB2SUB2ZEROZERO
Program CtrlProgram CtrlBBIDLEIDLENOPNOP
LogicalLogicalANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTNOTNOTORORSHLSHLSHRSHRSSHLSSHLXORXOR
Data MgmtData Mgmt
LDB/H/WLDB/H/WMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKHMVKLHMVKLHSTB/H/WSTB/H/W
Bit MgmtBit MgmtCLRCLREXTEXTLMBDLMBDNORMNORMSETSET
Note: Refer to the Note: Refer to the 'C6000 CPU Reference Guide for more details.
110
‘C67x: Superset of Fixed-Point (by units)
.L .L
.D .D
.S .S
.M .M
No Unit RequiredIDLENOP
.S UnitNEGNOT ORSETSHLSHRSSHLSUBSUB2XORZERO
ADDADDKADD2ANDBCLREXTMVMVCMVKMVKH
ABSSPABSDPCMPGTSPCMPEQSPCMPLTSPCMPGTDPCMPEQDPCMPLTDPRCPSPRCPDPRSQRSPRSQRDPSPDP
.L UnitNOTORSADDSATSSUBSUBSUBCXORZERO
ABSADDANDCMPEQCMPGTCMPLTLMBDMVNEGNORM
ADDSPADDDPSUBSPSUBDPINTSPINTDPSPINTDPINTSPRTUNCDPTRUNCDPSP
.M UnitSMPYSMPYH
MPYMPYHMPYLHMPYHL
MPYSPMPYDPMPYIMPYID
.D Unit
NEGSTB (B/H/W) SUBSUBAB (B/H/W) ZERO
ADDADDAB (B/H/W)LDB (B/H/W)LDDWMV
111
'C64x: Superset of ‘C62x Instruction SetData Pack/UnPACK2PACKH2PACKLH2PACKHL2PACKH4PACKL4UNPKHU4UNPKLU4SWAP2/4
Dual/Quad ArithABS2ADD2ADD4MAXMINSUB2SUB4SUBABS4
Bitwise LogicalANDN
Shift & MergeSHLMBSHRMB
Load ConstantMVK (5-bit)
Bit OperationsBITC4BITRDEALSHFL
MoveMVD
AverageAVG2AVG4
ShiftsROTLSSHVLSSHVR
MultipliesMPYHIMPYLIMPYHIRMPYLIRMPY2SMPY2DOTP2DOTPN2DOTPRSU2DOTPNRSU2DOTPU4DOTPSU4GMPY4XPND2/4
Mem AccessLDDWLDNWLDNDWSTDWSTNWSTNDW
Load ConstantMVK (5-bit)
Dual ArithmeticADD2SUB2
Bitwise LogicalANDANDNORXOR
Address Calc.ADDAD
Data Pack/UnPACK2PACKH2PACKLH2PACKHL2UNPKHU4UNPKLU4SWAP2SPACK2SPACKU4
Dual/Quad ArithSADD2SADDUS2SADD4
Bitwise LogicalANDN
Shifts & MergeSHR2SHRU2SHLMBSHRMB
ComparesCMPEQ2CMPEQ4CMPGT2CMPGT4
Branches/PCBDECBPOSBNOPADDKPC
.L .L
.M .M .D .D
.S .S
112
Sample C62x Compiler Benchmarks
TI C62x™ Compiler Performance Release 4.0: Execution Time in s @ 300 MHz Versus hand-coded assembly based on cycle count
Algorithm Used InAsm
CyclesAssembly Time
(s)C Cycles (Rel 4.0)
C Time (s)% Efficiency vs
Hand Coded
Block Mean Square Error MSE of a 20 column image matrix
For motion compensation of
image data348 1.16 402 1.34 87%
Codebook SearchCELP based voice
coders977 3.26 961 3.20 100%
Vector Max40 element input vector
Search Algorithms 61 0.20 59 0.20 100%
All-zero FIR Filter40 samples, 10 coefficients
VSELP based voice coders
238 0.79 280 0.93 85%
Minimum Error SearchTable Size = 2304
Search Algorithms 1185 3.95 1318 4.39 90%
IIR Filter16 coefficients
Filter 43 0.14 38 0.13 100%
IIR – cascaded biquads 10 Cascaded biquads (Direct Form II)
Filter 70 0.23 75 0.25 93%
MACTwo 40 sample vectors
VSELP based voice coders
61 0.20 58 0.19 100%
Vector SumTwo 44 sample vectors
51 0.17 47 0.16 100%
MSEMSE between two 256 element vectors
Mean Sq. Error Computation in
Vector Quantizer279 0.93 274 0.91 100%
Completely natural C code (non ’C6000 specific) Code available at: http://www.ti.com/sc/c6000compiler
Completely natural C code (non ’C6000 specific) Code available at: http://www.ti.com/sc/c6000compiler
113
* Includes traceback
Sample Imaging & Telecom Benchmarks
DSP & Image Processing Kernels
Cycle Count Performance
C62x C64xCycle Improvement
C64:C62720MHz C64x vs
300MHz C62x
Reed Solomon Decode: Syndrome Accumulation (204,188,8) Packet
1680 4703.5x 8.4x
cycles/packet
Viterbi Decode (GSM)(16 states)
38.25 14*
2.7x 6.5xcycles/output
FFT - Radix 4 - Complex (size = N log (N)) (16-bit)
12.7 6.0 2.1x 5xcycles/data
Polyphase Filter - Image Scaling (8-bit)
0.77 0.332.3x 5.5x
cycles/output/filter tap
Correlation - 3x3(8-bit)
4.5 1.283.5x 8.4x
cycles/pixel
Median Filter - 3x3(8-bit)
9.0 2.1 4.3x 10.3xcycles/pixel
Motion Estimation - 8x8 MAD (8-bit) 0.953 0.1267.6x 18.2x
cycles/pixel
114
Sample C62x Compiler Benchmarks
TI C62x™ Compiler Performance Release 4.0: Execution Time in s @ 300 MHz Versus hand-coded assembly based on cycle count
Algorithm Used InAsm
CyclesAssembly Time
(s)C Cycles (Rel 4.0)
C Time (s)% Efficiency vs
Hand Coded
Block Mean Square Error MSE of a 20 column image matrix
For motion compensation of
image data348 1.16 402 1.34 87%
Codebook SearchCELP based voice
coders977 3.26 961 3.20 100%
Vector Max40 element input vector
Search Algorithms 61 0.20 59 0.20 100%
All-zero FIR Filter40 samples, 10 coefficients
VSELP based voice coders
238 0.79 280 0.93 85%
Minimum Error SearchTable Size = 2304
Search Algorithms 1185 3.95 1318 4.39 90%
IIR Filter16 coefficients
Filter 43 0.14 38 0.13 100%
IIR – cascaded biquads 10 Cascaded biquads (Direct Form II)
Filter 70 0.23 75 0.25 93%
MACTwo 40 sample vectors
VSELP based voice coders
61 0.20 58 0.19 100%
Vector SumTwo 44 sample vectors
51 0.17 47 0.16 100%
MSEMSE between two 256 element vectors
Mean Sq. Error Computation in
Vector Quantizer279 0.93 274 0.91 100%
Completely natural C code (non ’C6000 specific) Code available at: http://www.ti.com/sc/c6000compiler
Completely natural C code (non ’C6000 specific) Code available at: http://www.ti.com/sc/c6000compiler
116
TMS320C6000 Memory
117
Memory size per device
Devices Internal EMIFA EMIFBC6201, C6701C6204, C6205
P = 64 kBD = 64 kB
52M Bytes (32-bits wide)
N/AC6202P = 256 kBD = 128 kB
C6203P = 384 kBD = 512 kB
C6211C6711 L1P = 4 kB
L1D = 4 kBL2 = 64 kB
128M Bytes (32-bits wide)
N/A
C671264M Bytes
(16-bits wide)
C6713L1P = 4 kBL1D = 4 kBL2 = 256 kB
128M Bytes (32-bits wide) N/A
C6411DM642
L1P = 16 kBL1D = 16 kBL2 = 256 kB
128M Bytes (32-bits wide) N/A
C6414C6415C6416
L1P = 16 kBL1D = 16 kBL2 = 1 MB
256M Bytes (64-bits wide)
64M Bytes (16-bits wide)
118
Internal Memory Summary
Devices Internal(L2)
External
C6211C6711C6713
64 kB512M
(32-bit wide)
C6712 256 kB512M
(16-bit wide)
LINK: TMS320C6000 DSP Generation
Devices Internal(L2)
External
C6414C6415C6416
1 MBA: 1GB (64-bit)B: 256kB (16-bit)
DM642 256 kB 1GB (64-bit)
C6411 256 kB 256MB (32-bit)
Performance
Making use of Parallelism
120
Given this simple loop …y =
40
cn xnn = 1
*
MVK .S1 40, cnt
loop:
LDH .D1 *cp++, c
LDH .D1 *xp++, x
MPY .M1 c, x, prod
ADD .L1 y, prod, y
SUB .L1 cnt, 1, cnt
[cnt] B .S1 loop
STW .D y, *yp
cx
prody
cnt
*cp*xp*yp
.M1 .M1
.L1 .L1
.S1 .S1
.D1 .D1
How many of these instructions can we get in parallel?
short mac(short *c, short *x, int count) { for (i=0; i < count; i++) { sum += c[i] * x[i]; } …
121
L2: ; PIPED LOOP PROLOG
LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7
LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7
[B0] B .S1 L3 || LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7
[B0] B .S1 L3|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7
[B0] B .S1 L3|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7
MPY .M2 B7,A3,B4|| MPYH .M1 B7,A3,A5|| [B0] B .S1 L3|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7
MPY .M2 B7,A3,B4|| MPYH .M1 B7,A3,A5|| [B0] B .S1 L3|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7;** -----------------------*L3: ; PIPED LOOP KERNEL ADD .L2 B4,B5,B5|| ADD .L1 A5,A0,A0|| MPY .M2 B7,A3,B4|| MPYH .M1 B7,A3,A5|| [B0]B .S1 L3|| [B0]SUB .S2 B0,1,B0|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7;** -----------------------*
C62x Intense Parallelism
What about the ‘C67x?
short mac(short *c, short *x, int count) { for (i=0; i < count; i++) { sum += c[i] * x[i]; } …
Given this C code
The C62x compiler can achieve Two Sum-of-Products per cycle
Given this C code
The C62x compiler can achieve Two Sum-of-Products per cycle
122
C67x MAC using Natural C
;** --------------------------------------------------*LOOP: ; PIPED LOOP KERNEL
LDDW .D1 A4++,A7:A6|| LDDW .D2 B4++,B7:B6|| MPYSP .M1X A6,B6,A5|| MPYSP .M2X A7,B7,B5|| ADDSP .L1 A5,A8,A8|| ADDSP .L2 B5,B8,B8|| [A1] B .S2 LOOP|| [A1] SUB .S1 A1,1,A1;** --------------------------------------------------*
float mac(float *c, float *x, int count){ int i, float sum = 0;
for (i=0; i < count; i++) { sum += c[i] * x[i]; } …
A0
A15
..
.M1.M1
.L1.L1
.D1.D1
.S1.S1
.M2.M2
.L2.L2
.D2.D2
.S2.S2
B0
B15
..
Controller/DecoderController/Decoder
Memory
Can the 'C64x do better?
The C67x compiler gets two 32-bit
floating-point
Sum-of-Products per iteration
The C67x compiler gets two 32-bit
floating-point
Sum-of-Products per iteration
123
C64x gets four MAC’s using DOTP2short mac(short *c, short *x, int count){ int i, short sum = 0;
for (i=0; i < count; i++) { sum += c[i] * x[i]; } …
;** --------------------------------------------------*; PIPED LOOP KERNELLOOP: ADD .L2 B8,B6,B6 || ADD .L1 A6,A7,A7 || DOTP2 .M2X B4,A4,B8 || DOTP2 .M1X B5,A5,A6 || [ B0] B .S1 LOOP || [ B0] SUB .S2 B0,-1,B0|| LDDW .D2T2 *B7++,B5:B4|| LDDW .D1T1 *A3++,A5:A4;** --------------------------------------------------*
A5
B5
A6
A7
x
=
+
m1 m0
n1 n0
m1*n1 + m0*n0
running sum
DOTP2
How many multiplies can the ‘C6x perform?
124
MMAC’s How many 16-bit MMACs (millions of MACs per second)
can the 'C6201 perform?
400 MMACs (two .M units x 200 MHz)
2 .M units x 2 16-bit MACs (per .M unit / per cycle)x 720 MHz---------------- 2880 MMACs
How about 16x16 MMAC’s on the ‘C64x devices?
How many 8-bit MMACs on the ‘C64x?
5760 MMACs (on 8-bit data)
125
How Do We Get Such High Parallelism?
Compiler and Assembly Optimizer use a technique called Software Pipelining
Software pipelining enables high performance (esp. on DSP-type loops)
Key point: Tools do all the work!
What is software pipelining?Let's look at a simple example ...
126
Tools Use Software Pipelining
Here’s a simple example to demonstrate ...
LDH
|| LDH
MPY
ADD
______________ cycles
How many cycles wouldit take to perform thisloop 5 times?
5 x 3 = 15
Our functional units could be used like ...
127
7
1
Cycle
2
3
4
5
6
Without Software Pipelining
ldh ldh
mpy
add
add
ldh ldh
mpy
ldh ldh
In seven cycles, we’re almost half-way done ...
.M1 .M2 .L1 .L2 .S1 .S2.D1 .D2
128
7
1
Cycle
2
3
4
5
6
With Software Pipelining
ldh ldh
mpyldh ldh
addmpyldh ldh
addmpyldh ldh
addmpyldh ldh
addmpy
add
Completes in only 7 cyclesCompletes in only 7 cycles
It takes 1/2 the time! How does this translate to code?
.M1 .M2 .L1 .L2 .S1 .S2.D1 .D2
129
S/W Pipelining Translated to Code
7
1
Cycle
2
3
4
5
6
.S1 .S2.D1 .D2ldh ldh
mpyldh ldh
addmpyldh ldh
addmpyldh ldh
addmpyldh ldh
addmpy
add
c1: LDH|| LDH
c2: MPY|| LDH|| LDH
c3: ADD|| MPY|| LDH|| LDH
DSK
Code Composer Studio
131
C6416 DSK
Diagnostic Utility included with DSK ...
132
C6416 DSK
Diagnostic Utility included with DSK ...
133
DSK’s Diagnostic Utility
CCS Overview ...
Test/Diagnose DSK hardware
Verify USB emulation link
Use Advanced tests to facilitate debugging
Reset DSK hardware
134
SIM
Simulator
Code Composer Studio
DSK’s Code Composer Studio Includes: Integrated Edit / Debug GUI
Edit
DSK
EVM
Third Party
BIOS: Real-time kernelReal-time analysis
DSP/BIOSLibraries
DSP/BIOSConfigTool
Debug
Code Generation Tools
CompilerAsm Opto
Asm
Standard Runtime Libraries
.outLink
XDS
DSP Board
CCS is Project centric ...
135
Code Generation
.outEditor
.sa
AsmOptimizer
.c / .cpp
Compiler
Asm.asm
Linker.obj
Link.cmd
.map
136
What is a Project?
Project (.PJT) file contain:
References to files: Source Libraries Linker, etc …
Project settings: Compiler Options DSP/BIOS Linking, etc …
The project menu ...
137
Project Menu
Access via pull-down menu or by right-clicking .pjt file in project explorer window
Project MenuHint:Create and open projects from the Project menu, not the File menu.
Hint:Create and open projects from the Project menu, not the File menu.
Build Options... Next slide
138
Build Options
-g -q -fr"c:\modem\Debug" -mv6700
Eight Categories of Compiler options
The most common Compiler Options are ...
140
Options Description
-mv6700 Generate ‘C67x code (‘C62x is default)-mv6400 Generate 'C64x code-fr <dir> Directory for object/output files-fs <dir> Directory for assembly files
-q Quiet mode (display less info while compiling)-g Enables src-level symbolic debugging-s Interlist C statements into assembly listing
Compiler’s Build Options Nearly one-hundred compiler options available to
tune your code's performance, size, etc. Following table lists the most common options:
debug options
In Chapter 4 we will examine the options which enable the compiler’s optimizer
And, the Config Tool ...
142
DSP/BIOS Configuration Tool
Simplifies system design by: Automatically includes the appropriate
runtime support libraries Automatically handles interrupt vectors
and system reset Handles system memory configuration
(builds CMD file) Generates 5 files when CDB file is saved:
C file, Asm file, 2 header files and a linker command (.cmd) file
More to be discussed later …
Simplifies system design by: Automatically includes the appropriate
runtime support libraries Automatically handles interrupt vectors
and system reset Handles system memory configuration
(builds CMD file) Generates 5 files when CDB file is saved:
C file, Asm file, 2 header files and a linker command (.cmd) file
More to be discussed later …
143
‘C6000 C Data Types
Type Size Representation
char, signed char 8 bits ASCIIunsigned char 8 bits ASCIIshort 16 bits 2’s complementunsigned short 16 bits binaryint, signed int 32 bits 2s complement unsigned int 32 bits binarylong, signed long 40 bits 2’s complementunsigned long 40 bits binaryenum 32 bits 2’s complementfloat 32 bits IEEE 32-bitdouble 64 bits IEEE 64-bitlong double 64 bits IEEE 64-bitpointers 32 bits binary
146
GEL Scripting
GEL: General Extension Language
C style syntax Large number of debugger
commands as GEL functions Write your own functions Create GEL menu items
GEL: General Extension Language
C style syntax Large number of debugger
commands as GEL functions Write your own functions Create GEL menu items
147
Debug using VB Script or Perl Using CCS Scripting, a simple script can:
Start CCS Load a file Read/write memory Set/clear breakpoints Run, and perform other basic testing
functions
Debug using VB Script or Perl Using CCS Scripting, a simple script can:
Start CCS Load a file Read/write memory Set/clear breakpoints Run, and perform other basic testing
functions
CCS Scripting
148
TCONF Scripting (CDB)Tconf Script (.tcf) hello_dsk62cfg.tcf
#include <log.h>
extern LOG_Obj trace; /* created in hello.tci */
int main() { LOG_printf(&trace, "Hello World!\n"); return (0);}
Your Application hello.c
utils.loadPlatform("dsk6211");
/* load DSK6211 platform into TCOM */utils.getProgObjs(prog);
/* make all prog objects JavaScript global vars */LOG_system.bufLen = 128;
/* set buffer length of LOG_system to 128 */utils.importFile("hello");
/* import portable application script */prog.gen();
/* generate cfg files (and CDB file) */
var trace = LOG.create("trace");
/* create a new user log, named trace */trace.bufLen = 32;
/* initialize its length to 32 (words) */
Tconf Include File (.tci) hello.tci
• A textual way to configure CDB files• Runs on both PC and Unix• Create #include type files (.tci)• More flexible than Config Tool
• A textual way to configure CDB files• Runs on both PC and Unix• Create #include type files (.tci)• More flexible than Config Tool
Chapter 2
TMS320C6000 Architectural Overview
- End -