Multicore Training
C6614/6612 Memory System
MPBU Application Team
Multicore Training
Agenda
1. Overview of the 6614/6612 TeraNet 2. Memory System – DSP CorePac Point of View
1. Overview of Memory Map2. MSMC and External Memory
3. Memory System – ARM Point of View1. Overview of Memory Map2. ARM Subsystem Access to Memory
4. ARM-DSP CorePac Communication
Multicore Training
Agenda
1. Overview of the 6614/6612 TeraNet 2. Memory System – DSP CorePac Point of View
1. Overview of Memory Map2. MSMC and External Memory
3. Memory System – ARM Point of View1. Overview of Memory Map2. ARM Subsystem Access to Memory
4. ARM-DSP CorePac Communication1. SysLib and its libraries2. MSGCOM 3. Pktlib4. Resource Manager
Multicore Training
Cores @ 1.0 GHz / 1.2 GHz
C66x™CorePac
TCI6614
MSMC
2MBMSM
SRAM
64-Bit DDR3 EMIF
BCP
x2
x2
Coprocessors
VCP2x4
PowerManagement
Debug & Trace
Boot ROM
Semaphore
MemorySubsystem
SR
I O
x4
PC
I e
x2
UA
RT
x2
AIF
2x
6
SP
I
IC
2
PacketDMA
Multicore Navigator
QueueManager
EM
IF 1
6
x3 32KB L1P-Cache
32KB L1D-Cache
1024KB L2 Cache
RSA RSA
x2
PLL
EDMA
x3
HyperLink TeraNet
Network CoprocessorS
wit c
h
Eth
ern
et
Sw
it ch
SG
MII
x2Packet
Accelerator
SecurityAccelerator
FFTC
TCP3d
TAC
x2RAC
ARMCortex-A832KB L1P-Cache
32KB L1D-Cache
256KB L2 Cache
US
I M
TCI6614 Functional Architecture
Multicore Training
QMSS
C6614 TeraNet Data Connections
MSMCDDR3
Shared L2 S
S
CoreS
PCIe
S
TAC_BES
SRIO
PCIe
QM_SS
M
M
M
TPCC16ch QDMA
MTC0MTC1
M
MD
DR3
XMC
M
DebugSS M
TPCC64ch
QDMA
MTC2MTC3MTC4MTC5
TPCC64ch
QDMA
MTC6MTC7MTC8MTC9
Network Coprocessor
M
HyperLink M
HyperLinkS
AIF / PktDMA M
FFTC / PktDMA M
RAC_BE0,1 M
TAC_FE M
SRIOS
S
RAC_FES
TCP3dS
TCP3e_W/RS
VCP2 (x4)S
M
EDMA_0
EDMA_1,2
CoreS MCoreS ML2 0-3S M
CPUCLK/2
256bit TeraNet 2A
FFTC / PktDMA M
TCP3dS
RAC_FES
VCP2 (x4)S VCP2 (x4)S VCP2 (x4)S
RAC_BE0,1 M
CPUCLK/3
128bit TeraNet 3A
S S S S
CPUCLK/2256bit TeraNet
2B
MPU
DD
R3
ARM
ToTeraNet
2B
From ARM
Multicore Training
Agenda
1. Overview of the 6614/6612 TeraNet 2. Memory System – DSP CorePac Point of View
1. Overview of Memory Map2. MSMC and External Memory
3. Memory System – ARM Point of View1. Overview of Memory Map2. ARM Subsystem Access to Memory
4. ARM-DSP CorePac Communication1. SysLib and its libraries2. MSGCOM 3. Pktlib4. Resource Manager
Multicore Training
SoC Memory Map 1/2??? ??? Size?? Function??
00800 0000 0087 FFFF 512K L2 SRAM
00E0 0000 00E0 7FFF 32K L1P
00F0 0000 00F0 7FFF 32K L1D
0220 0000 0220 007F 128K Timer 0
0264 0000 0264 07FF 2K Semaphores
0270 0000 0270 7FFF 32K EDMA CC
027D 0000 027d 3FFF 16K TETB Core 0
0c00 0000 0c3f FFFF 4M Shared L2
1080 0000 1087 FFFF 512K L2 Core 0 Global
12E0 0000 12e0 7FFF 32K Core 2 L1P Global
Multicore Training
SoC Memory Map 2/2 ??? ??? Size?? Function??
2000 0000 200F FFFF 1M System Trace Management Configuration
2180 0000 33FF FFFF 296M+32K Reserved
3400 0000 341F FFFF 2M QMSS data
3420 0000 3FFF FFFF 190M Reserved
4000 0000 4FFF FFFF 256M HyperLink Data
5000 0000 5FFF FFFF 256K Reserved
6000 0000 6FFF FFFF 256K PCIe Data
7000 0000 73FF FFFF 64M EMIF16 data NAND Memory (CS2)
8000 0000 FFFF FFFF 2G DDR3 Data
Multicore Training
MSMC Block DiagramCorePac 2
Shared RAM2048 KB
CorePac Slave Port
CorePac Slave Port
SystemSlave Port
forShared SRAM
(SMS)
System Slave Port
for External Memory
(SES)
MSMC System Master Port
MSMC EMIF Master Port
MSMC Datapath
Arbitration256
256
256
MemoryProtection &
ExtensionUnit
(MPAX)
256 256
Events
MemoryProtection &
ExtensionUnit
(MPAX)
MSMC Core
To SCR_2_Band the DDR
Tera
Net
TeraNet
256
Error Detection & Correction (EDC)
256
256
256
CorePac Slave Port
CorePac Slave Port
256 256
XMCMPAX
CorePac 3
XMCMPAX
CorePac 0
XMCMPAX
CorePac 1
XMCMPAX
Multicore Training
XMC – External Memory Controller
The XMC is responsible for the following:
1. Address extension/translation2. Memory protection for addresses outside C66x3. Shared memory access path4. Cache and pre-fetch support
User Control of XMC:
5. MPAX (Memory Protection and Extension) Registers6. MAR (Memory Attributes) Registers
Each core has its own set of MPAX and MAR registers!
Multicore Training
The MPAX Registers
MPAX (Memory Protection and Extension) Registers • Translate between physical and logical address• 16 registers (64 bits each) control (up to) 16 memory
segments.• Each register translates logical memory into physical memory
for the segment.
Multicore Training
The MAR RegistersMAR (Memory Attributes) Registers:• 256 registers (32 bits each) control 256 memory segment.
– Each segment size is 16MBytes, from logical address 0x0000 0000 to address 0xFFFF FFFF.
– The first 16 registers are read only. They control the internal memory of the core.
• Each register controls the cacheability of the segment (bit 0) and the pre-fetch-ability (bit 3). All other bits are reserved and set to 0.
• All MAR bits are set to zero after reset.
Multicore Training
• Speeds up processing by making shared L2 cached by private L2 (L3 shared).
• Uses the same logical address in all cores; Each one points to a different physical memory.
• Uses part of shared L2 to communicate between cores. So makes part of shared L2 non-cacheable, but leaves the rest of shared L2 cacheable.
• Utilizes 8G of external memory; 2G for each core.
XMC: Typical Use Cases
Multicore Training
Agenda
1. Overview of the 6614/6612 TeraNet 2. Memory System – DSP CorePac Point of View
1. Overview of Memory Map2. MSMC and External Memory
3. Memory System – ARM Point of View1. Overview of Memory Map2. ARM Subsystem Access to Memory
4. ARM-DSP CorePac Communication1. SysLib and its libraries2. MSGCOM 3. Pktlib4. Resource Manager
Multicore Training
ARM Core
AXI2VBUS Bridge
(CPU/2)
SSMCPU/2
AINTCCPU/2
Clk Div
Sec/PublicROM 176KB
ublic
ICE Crusher
System Interrupts
Debug Bus
L1D 32KB
L2 Cache256 KB
Integer Core
ger
Neon Core
ARM A8 Core
1.2GHz
L1L 32KB
128
/32
Sec/Public RAM 64KB
OCP2ATB
CoreSight Embedded
Trace Macrocell
ARM Corepac
/32
/64
256b VBUSM running at CPU/2Connecting to ARM_128 switch
for DDR_EMIF
128b VBUSM running at CPU/3Connecting to ARM_64 switch
Master 0 Master 1
/32
Multicore Training
ARM Subsystem Memory Map
Multicore Training
ARM Subsystem Ports
• 32-bit ARM addressing (MMU or Kernel)• 31 bits addressing into the external memory
– ARM can address ONLY 2GB of external DDR (No MPAX translation) 0x8000 0000 to 0xFFFF FFFF
• 31 bits are used to access SOC memory or to address internal memory (ROM)
Multicore Training
ARM Visibility Through the TeraNet Connection
• It can see the QMSS data at address 0x3400 0000• It can see HyperLink data at address 0x4000 0000• It can see PCIe data at address 0x6000 0000• It can see shared L2 at address 0x0c00 0000 • It can see EMIF 16 data at address 0x7000 0000
– NAND– NOR– Asynchronous SRAM
Multicore Training
ARM Access SOC Memory
• Do you see a problem with HyperLink access?– Addresses in the 0x4 range are part of the internal ARM
memory map
• What about the cache and data from the Shared Memory and the Async EMIF16?– The next slide presents a page from the device errata
Multicore Training
Errata User’s Note Number 10
Multicore Training
Additional Comments About the ARM
• ARM uses only Little Endian.• DSP CorePac can use Little Endian or Big
Endian.• The User’s Guide shows how to mix ARM core
Little Endian code with DSP CorePac Big Endian.
Multicore Training
Agenda
1. Overview of the 6614/6612 TeraNet 2. Memory System – DSP CorePac Point of View
1. Overview of Memory Map2. MSMC and External Memory
3. Memory System – ARM Point of View1. Overview of Memory Map2. ARM Subsystem Access to Memory
4. ARM-DSP CorePac Communication1. SysLib and its libraries2. MSGCOM 3. Pktlib4. Resource Manager
Multicore Training
MCSDK Software Layers
Hardware
SYS/BIOSRTOS
Software Framework Components
InterprocessorCommunication
Instrumentation(MCSA)
Communication Protocols
TCP/IPNetworking
(NDK)
Algorithm Libraries
DSPLIB IMGLIB MATHLIB
Demonstration Applications
HUA/OOB IO BmarksImage
Processing
Low-Level Drivers (LLDs)
Chip Support Library
EDMA3
PCIe
PA
QMSS
SRIO
CPPI
FFTC
HyperLink
TSIP
…
Platform/EVM Software
Bootloader
PlatformLibrary
POST
OSAL
ResourceManager
Transports- IPC- NDK
Multicore Training
SysLib Library – An IPC element
Multicore Training
MSGCOM Library
• Purpose - Exchange messages between a reader and writer
• Read/write applications can reside on the same DSP core, different DSP cores or ARM and DSP core.
• Channel based communication. A channel is defined by a reader (message destination) side. It can support multiple writers (message sources)
Multicore Training
Channels Types
• Simple queue channels – messages are places directly into a destination queue that is associated with a reader.
• Virtual Channels – multiple virtual channels are associated with the same hardware queue
• Queue DMA channels – messages are transferred between the writer and the reader
• Proxy Queue Channels – Indirect channels works over BSD sockets, enable communications between writer and reader that are not connected to the same Navigator
Multicore Training
Interrupt Types
• No interrupt; reader poll until a message arrive• Direct Interrupt; low-delay system. Special queues
must be used• Accumulated interrupts; Special queues are used.
Reader gets an interrupt when the number of messages crosses threshold
Multicore Training
Blocking and Non-Blocking
• The reader can be blocked until message is available
• The reader polls for message and if there is no message it continues execution
Multicore Training
Case 1 – Generic Channel communication
Zero Copy based Constructions Core to Core
RE
AD
ER
WR
ITE
R
MyCh1
Put(hCh,msg);Tibuf *msg = PktLibAlloc(hHeap);
PktLibFree(msg);Tibuf *msg =Get(hCh);
hCh=Find(“MyCh1”); hCh = Create(“MyCh1”);
Delete(hCh);
Note – logical function only
1. Reader create a channel ahead of time with a given name
2. When writer has information to write it looks for the channel (find)
3. The write asks for buffer and writes the message into the buffer
4. The writer put the buffer. The navigator does it magic5. When the reader calls get, it gets the message6. The reader responsibility is to free the message after it
is done reading
Multicore Training
Case 2 – Low-Latency Channel communicationSingle and Virtual Channel
Zero Copy based Constructions Core to Core
RE
AD
ER
WR
ITE
R
Note – logical function only
1. Reader create a channel based on one of the pending queues ahead of time with a given name. 2. The reader waits for the message by pending on a (software) semaphore3. When writer has information to write it looks for the channel (find)4. The write asks for buffer and writes the message into the buffer5. The writer put the buffer. The navigator generate an interrupt . The ISR post the semaphore to the
correct channel6. The reader start processing the message7. Virtual channel structure enables usage of a single interrupt to post semaphore to one of many
channels
MyCh3
MyCh2hCh = Create(“MyCh2”);
Posts internal Sem and/or callback posts MySem;chRx(driver)
Put(hCh,msg);Tibuf *msg = PktLibAlloc(hHeap);
PktLibFree(msg);
hCh=Find(“MyCh2”); Get(hCh); or Pend(MySem);
hCh = Create(“MyCh3”);Get(hCh); or Pend(MySem);
PktLibFree(msg);Put(hCh,msg);Tibuf *msg = PktLibAlloc(hHeap);hCh=Find(“MyCh3”);
Multicore Training
Case 3 – Reduce context Switching
Zero Copy based Constructions Core to Core
RE
AD
ER
WR
ITE
R
Note – logical function only
1. Reader create a channel based on one of the accumulator queues ahead of time with a given name. 2. When writer has information to write it looks for the channel (find)3. The write asks for buffer and writes the message into the buffer4. The writer put the buffer. The Navigator adds the message to an accumulator queue5. When the number of messages reaches a water mark, or after a pre-defined time out, the
accumulator sends an interrupt to the core6. The reader start processing the message and free after it is done
MyCh4
Accumulator
chRx(driver)
PktLibFree(msg);
Tibuf *msg =Get(hCh);
Delete(hCh);
Put(hCh,msg);Tibuf *msg = PktLibAlloc(hHeap);
hCh=Find(“MyCh4”);
hCh = Create(“MyCh4”);
Multicore Training
ARM to Core Communication
• For protection, User’s space does not involved with physical memory. All queues and descriptors manipulations are done by Kernel Space
• A set of user’s space to Kernel space APIs hides the kernel space operation and the hardware from application code (part of the User’s space)
• Kernel’s virtual queue module (VirtQueue) provides the application with pointers to buffers
Multicore Training
Case 4 – Generic Channel Communication
ARM to DSP communications via Linux Kernel VirtQueue
RE
AD
ER
WR
ITE
R
Note – logical function only
1. Reader create a channel ahead of time with a given name2. When writer has information to write it looks for the channel (find). The kernel is aware of the user’s space
handle3. The write asks for buffer. The kernel dedicate a descriptor to the channel and gives the write a pointer to a
buffer that is associated with the descriptor. The write writes the message into the buffer. 4. The writer put the buffer. The kernel push the descriptor into the right queue. The navigator does loopback
(copy the descriptor data) and free the Kernel queue. Then the navigator load the data into another descriptor and sends it to the appropriate core.
5. When the reader calls get, it gets the message6. The reader responsibility is to free the message after it is done reading
MyCh5
Put(hCh,msg);msg = PktLibAlloc(hHeap);
PktLibFree(msg);
Tibuf *msg =Get(hCh);hCh=Find(“MyCh5”);
hCh = Create(“MyCh5”);
Delete(hCh);
RxPKTDMA
TxPKTDMA
Multicore Training
Case 5 – Low-Latency Channel communication
ARM to DSP communications via Linux Kernel VirtQueue
RE
AD
ER
WR
ITE
R
Note – logical function only
1. Reader create a channel based on one of the pending queues ahead of time with a given name. 2. The reader waits for the message by pending on a (software) semaphore3. When writer has information to write it looks for the channel (find). The Kernel space is aware of the handle4. The write asks for buffer. The kernel dedicate a descriptor to the channel and gives the write a pointer to a buffer that
is associated with the descriptor. The write writes the message into the buffer. 5. The writer put the buffer. The kernel push the descriptor into the right queue. The navigator does loopback (copy the
descriptor data) and free the Kernel queue. Then the navigator load the data into another descriptor , move it to the right queue and generate an interrupt . The ISR post the semaphore to the correct channel
6. The reader start processing the message7. Virtual channel structure enables usage of a single interrupt to post semaphore to one of many channels
PktLibFree(msg);
MyCh6
PktLibFree(msg);
hCh = Create(“MyCh6”);
RxPKTDMA
chIRx(driver) Get(hCh); or Pend(MySem);
TxPKTDMA
Put(hCh,msg);msg = PktLibAlloc(hHeap);
hCh=Find(“MyCh6”);
Delete(hCh);
Multicore Training
Case 6 – Reduce Context Switching
ARM-to-DSP communications via Linux Kernel VirtQueue
RE
AD
ER
WR
ITE
R
Note – logical function only
1. Reader creates a channel based on one of the accumulator queues. The channel is created ahead of time with a given name.
2. When Writer has information to write, it looks for the channel (find). The Kernel space is aware of the handle.
3. The Writer asks for a buffer. The kernel dedicates a descriptor to the channel and gives the Write a pointer to a buffer that is associated with the descriptor. The Writer writes the message into the buffer.
4. The Writer puts the buffer. The Kernel pushes the descriptor into the right queue. The Navigator does a loopback (copies the descriptor data) and frees the Kernel queue. Then the Navigator loads the data into another descriptor. Then the Navigator adds the message to an accumulator queue.
5. When the number of messages reaches a watermark, or after a pre-defined time out, the accumulator sends an interrupt to the core.
6. The Reader starts processing the message and frees it after it is complete.
MyCh7
PktLibFree(msg);
Msg = Get(hCh);
hCh = Create(“MyCh7”);
RxPKTDMA Accumulator
chRx(driver)
TxPKTDMA
Put(hCh,msg);msg = PktLibAlloc(hHeap);
hCh=Find(“MyCh7”);
Delete(hCh);
Multicore Training
Code Example
ReaderhCh = Create(“MyChannel”, ChannelType, struct *ChannelConfig); // Reader specifies what channel it wants to create
// For each messageGet(hCh, &msg) // Either Blocking or Non-blocking call,pktLibFreeMsg(msg); // Not part of IPC API, the way reader frees the message can be application specific
Delete(hCh);
Writer:hHeap = pktLibCreateHeap(“MyHeap); // Not part of IPC API, the way writer allocates the message can be application
specifichCh = Find(“MyChannel”);
//For each messagemsg = pktLibAlloc(hHeap); // Not part of IPC API, the way reader frees the message can be application specificPut(hCh, msg); // Note: if Copy=PacketDMA, msg is freed my Tx DMA.…msg = pktLibAlloc(hHeap); // Not part of IPC API, the way reader frees the message can be application specificPut(hCh, msg);
Multicore Training
pktlib Library
• Purpose –High level library to allocate packets and manipulate packets used by different types of channels
• Enhance capabilities of packets manipulation
Multicore Training
Heap Allocation
• Heap creation – support shared Heaps and private heaps
• Heap is identified by name. It contains Data buffer Packets or Zero Buffer Packets
• Heap size is determined by application• Typical pktlib functions:
– Pktlib_createHeap– Pktlib_findHeapbyName– Pktlib_allocPacket
Multicore Training
Packets Manipulations
• Merge multiple packets into one (linked) packet
• Clone packet• Split Packet into multiple packets• Typical pktlib functions:
– Pktlib_packetMerge– Pktlib_clonePacket– Pktlib_splitPacket
Multicore Training
Pktlib additional features
• Clean up and garbage collection (especially for clone packets and split packets)
• Heap statistics• Cache coherency
Multicore Training
RESMGR Library
• Purpose – set of utilities to manage and distribute system resources between multiple users and applications
• The application asks for a resource. If the resource is available it get it. Otherwise, and error is return
Multicore Training
RESMGR Controls
• General purpose queues• Accumulator Channels• Hardware semaphores• Direct Interrupt queues• Memory region request