C6614/6612 Memory System

Multicore Training

C6614/6612 Memory System

MPBU Application Team

Multicore Training

Agenda

1. Overview of the 6614/6612 TeraNet 2. Memory System – DSP CorePac Point of View

1. Overview of Memory Map2. MSMC and External Memory

3. Memory System – ARM Point of View1. Overview of Memory Map2. ARM Subsystem Access to Memory

4. ARM-DSP CorePac Communication

Multicore Training

Agenda




4. ARM-DSP CorePac Communication1. SysLib and its libraries2. MSGCOM 3. Pktlib4. Resource Manager

Multicore Training

Cores @ 1.0 GHz / 1.2 GHz

C66x™CorePac

TCI6614

MSMC

2MBMSM

SRAM

64-Bit DDR3 EMIF

BCP

x2

x2

Coprocessors

VCP2x4

PowerManagement

Debug & Trace

Boot ROM

Semaphore

MemorySubsystem

SR

I O

x4

PC

I e

x2

UA

RT

x2

AIF

2x

6

SP

I

IC

2

PacketDMA

Multicore Navigator

QueueManager

EM

IF 1

6

x3 32KB L1P-Cache

32KB L1D-Cache

1024KB L2 Cache

RSA RSA

x2

PLL

EDMA

x3

HyperLink TeraNet

Network CoprocessorS

wit c

h

Eth

ern

et

Sw

it ch

SG

MII

x2Packet

Accelerator

SecurityAccelerator

FFTC

TCP3d

TAC

x2RAC

ARMCortex-A832KB L1P-Cache

32KB L1D-Cache

256KB L2 Cache

US

I M

TCI6614 Functional Architecture

Multicore Training

QMSS

C6614 TeraNet Data Connections

MSMCDDR3

Shared L2 S

S

CoreS

PCIe

S

TAC_BES

SRIO

PCIe

QM_SS

M

M

M

TPCC16ch QDMA

MTC0MTC1

M

MD

DR3

XMC

M

DebugSS M

TPCC64ch

QDMA

MTC2MTC3MTC4MTC5

TPCC64ch

QDMA

MTC6MTC7MTC8MTC9

Network Coprocessor

M

HyperLink M

HyperLinkS

AIF / PktDMA M

FFTC / PktDMA M

RAC_BE0,1 M

TAC_FE M

SRIOS

S

RAC_FES

TCP3dS

TCP3e_W/RS

VCP2 (x4)S

M

EDMA_0

EDMA_1,2

CoreS MCoreS ML2 0-3S M

CPUCLK/2

256bit TeraNet 2A

FFTC / PktDMA M

TCP3dS

RAC_FES

VCP2 (x4)S VCP2 (x4)S VCP2 (x4)S

RAC_BE0,1 M

CPUCLK/3

128bit TeraNet 3A

S S S S

CPUCLK/2256bit TeraNet

2B

MPU

DD

R3

ARM

ToTeraNet

2B

From ARM

Multicore Training

Agenda





Multicore Training

SoC Memory Map 1/2??? ??? Size?? Function??

00800 0000 0087 FFFF 512K L2 SRAM

00E0 0000 00E0 7FFF 32K L1P

00F0 0000 00F0 7FFF 32K L1D

0220 0000 0220 007F 128K Timer 0

0264 0000 0264 07FF 2K Semaphores

0270 0000 0270 7FFF 32K EDMA CC

027D 0000 027d 3FFF 16K TETB Core 0

0c00 0000 0c3f FFFF 4M Shared L2

1080 0000 1087 FFFF 512K L2 Core 0 Global

12E0 0000 12e0 7FFF 32K Core 2 L1P Global

Multicore Training

SoC Memory Map 2/2 ??? ??? Size?? Function??

2000 0000 200F FFFF 1M System Trace Management Configuration

2180 0000 33FF FFFF 296M+32K Reserved

3400 0000 341F FFFF 2M QMSS data

3420 0000 3FFF FFFF 190M Reserved

4000 0000 4FFF FFFF 256M HyperLink Data

5000 0000 5FFF FFFF 256K Reserved

6000 0000 6FFF FFFF 256K PCIe Data

7000 0000 73FF FFFF 64M EMIF16 data NAND Memory (CS2)

8000 0000 FFFF FFFF 2G DDR3 Data

Multicore Training

MSMC Block DiagramCorePac 2

Shared RAM2048 KB

CorePac Slave Port

CorePac Slave Port

SystemSlave Port

forShared SRAM

(SMS)

System Slave Port

for External Memory

(SES)

MSMC System Master Port

MSMC EMIF Master Port

MSMC Datapath

Arbitration256

256

256

MemoryProtection &

ExtensionUnit

(MPAX)

256 256

Events

MemoryProtection &

ExtensionUnit

(MPAX)

MSMC Core

To SCR_2_Band the DDR

Tera

Net

TeraNet

256

Error Detection & Correction (EDC)

256

256

256

CorePac Slave Port

CorePac Slave Port

256 256

XMCMPAX

CorePac 3

XMCMPAX

CorePac 0

XMCMPAX

CorePac 1

XMCMPAX

Multicore Training

XMC – External Memory Controller

The XMC is responsible for the following:

1. Address extension/translation2. Memory protection for addresses outside C66x3. Shared memory access path4. Cache and pre-fetch support

User Control of XMC:

5. MPAX (Memory Protection and Extension) Registers6. MAR (Memory Attributes) Registers

Each core has its own set of MPAX and MAR registers!

Multicore Training

The MPAX Registers

MPAX (Memory Protection and Extension) Registers • Translate between physical and logical address• 16 registers (64 bits each) control (up to) 16 memory

segments.• Each register translates logical memory into physical memory

for the segment.

Multicore Training

The MAR RegistersMAR (Memory Attributes) Registers:• 256 registers (32 bits each) control 256 memory segment.

– Each segment size is 16MBytes, from logical address 0x0000 0000 to address 0xFFFF FFFF.

– The first 16 registers are read only. They control the internal memory of the core.

• Each register controls the cacheability of the segment (bit 0) and the pre-fetch-ability (bit 3). All other bits are reserved and set to 0.

• All MAR bits are set to zero after reset.

Multicore Training

• Speeds up processing by making shared L2 cached by private L2 (L3 shared).

• Uses the same logical address in all cores; Each one points to a different physical memory.

• Uses part of shared L2 to communicate between cores. So makes part of shared L2 non-cacheable, but leaves the rest of shared L2 cacheable.

• Utilizes 8G of external memory; 2G for each core.

XMC: Typical Use Cases

Multicore Training

Agenda





Multicore Training

ARM Core

AXI2VBUS Bridge

(CPU/2)

SSMCPU/2

AINTCCPU/2

Clk Div

Sec/PublicROM 176KB

ublic

ICE Crusher

System Interrupts

Debug Bus

L1D 32KB

L2 Cache256 KB

Integer Core

ger

Neon Core

ARM A8 Core

1.2GHz

L1L 32KB

128

/32

Sec/Public RAM 64KB

OCP2ATB

CoreSight Embedded

Trace Macrocell

ARM Corepac

/32

/64

256b VBUSM running at CPU/2Connecting to ARM_128 switch

for DDR_EMIF

128b VBUSM running at CPU/3Connecting to ARM_64 switch

Master 0 Master 1

/32

Multicore Training

ARM Subsystem Memory Map

Multicore Training

ARM Subsystem Ports

• 32-bit ARM addressing (MMU or Kernel)• 31 bits addressing into the external memory

– ARM can address ONLY 2GB of external DDR (No MPAX translation) 0x8000 0000 to 0xFFFF FFFF

• 31 bits are used to access SOC memory or to address internal memory (ROM)

Multicore Training

ARM Visibility Through the TeraNet Connection

• It can see the QMSS data at address 0x3400 0000• It can see HyperLink data at address 0x4000 0000• It can see PCIe data at address 0x6000 0000• It can see shared L2 at address 0x0c00 0000 • It can see EMIF 16 data at address 0x7000 0000

– NAND– NOR– Asynchronous SRAM

Multicore Training

ARM Access SOC Memory

• Do you see a problem with HyperLink access?– Addresses in the 0x4 range are part of the internal ARM

memory map

• What about the cache and data from the Shared Memory and the Async EMIF16?– The next slide presents a page from the device errata

Multicore Training

Errata User’s Note Number 10

Multicore Training

Additional Comments About the ARM

• ARM uses only Little Endian.• DSP CorePac can use Little Endian or Big

Endian.• The User’s Guide shows how to mix ARM core

Little Endian code with DSP CorePac Big Endian.

Multicore Training

Agenda





Multicore Training

MCSDK Software Layers

Hardware

SYS/BIOSRTOS

Software Framework Components

InterprocessorCommunication

Instrumentation(MCSA)

Communication Protocols

TCP/IPNetworking

(NDK)

Algorithm Libraries

DSPLIB IMGLIB MATHLIB

Demonstration Applications

HUA/OOB IO BmarksImage

Processing

Low-Level Drivers (LLDs)

Chip Support Library

EDMA3

PCIe

PA

QMSS

SRIO

CPPI

FFTC

HyperLink

TSIP

…

Platform/EVM Software

Bootloader

PlatformLibrary

POST

OSAL

ResourceManager

Transports- IPC- NDK

Multicore Training

SysLib Library – An IPC element

Multicore Training

MSGCOM Library

• Purpose - Exchange messages between a reader and writer

• Read/write applications can reside on the same DSP core, different DSP cores or ARM and DSP core.

• Channel based communication. A channel is defined by a reader (message destination) side. It can support multiple writers (message sources)

Multicore Training

Channels Types

• Simple queue channels – messages are places directly into a destination queue that is associated with a reader.

• Virtual Channels – multiple virtual channels are associated with the same hardware queue

• Queue DMA channels – messages are transferred between the writer and the reader

• Proxy Queue Channels – Indirect channels works over BSD sockets, enable communications between writer and reader that are not connected to the same Navigator

Multicore Training

Interrupt Types

• No interrupt; reader poll until a message arrive• Direct Interrupt; low-delay system. Special queues

must be used• Accumulated interrupts; Special queues are used.

Reader gets an interrupt when the number of messages crosses threshold

Multicore Training

Blocking and Non-Blocking

• The reader can be blocked until message is available

• The reader polls for message and if there is no message it continues execution

Multicore Training

Case 1 – Generic Channel communication

Zero Copy based Constructions Core to Core

RE

AD

ER

WR

ITE

R

MyCh1

Put(hCh,msg);Tibuf *msg = PktLibAlloc(hHeap);

PktLibFree(msg);Tibuf *msg =Get(hCh);

hCh=Find(“MyCh1”); hCh = Create(“MyCh1”);

Delete(hCh);

Note – logical function only

1. Reader create a channel ahead of time with a given name

2. When writer has information to write it looks for the channel (find)

3. The write asks for buffer and writes the message into the buffer

4. The writer put the buffer. The navigator does it magic5. When the reader calls get, it gets the message6. The reader responsibility is to free the message after it

is done reading

Multicore Training

Case 2 – Low-Latency Channel communicationSingle and Virtual Channel


RE

AD

ER

WR

ITE

R


1. Reader create a channel based on one of the pending queues ahead of time with a given name. 2. The reader waits for the message by pending on a (software) semaphore3. When writer has information to write it looks for the channel (find)4. The write asks for buffer and writes the message into the buffer5. The writer put the buffer. The navigator generate an interrupt . The ISR post the semaphore to the

correct channel6. The reader start processing the message7. Virtual channel structure enables usage of a single interrupt to post semaphore to one of many

channels

MyCh3

MyCh2hCh = Create(“MyCh2”);

Posts internal Sem and/or callback posts MySem;chRx(driver)


PktLibFree(msg);

hCh=Find(“MyCh2”); Get(hCh); or Pend(MySem);

hCh = Create(“MyCh3”);Get(hCh); or Pend(MySem);

PktLibFree(msg);Put(hCh,msg);Tibuf *msg = PktLibAlloc(hHeap);hCh=Find(“MyCh3”);

Multicore Training

Case 3 – Reduce context Switching


RE

AD

ER

WR

ITE

R


1. Reader create a channel based on one of the accumulator queues ahead of time with a given name. 2. When writer has information to write it looks for the channel (find)3. The write asks for buffer and writes the message into the buffer4. The writer put the buffer. The Navigator adds the message to an accumulator queue5. When the number of messages reaches a water mark, or after a pre-defined time out, the

accumulator sends an interrupt to the core6. The reader start processing the message and free after it is done

MyCh4

Accumulator

chRx(driver)

PktLibFree(msg);

Tibuf *msg =Get(hCh);

Delete(hCh);


hCh=Find(“MyCh4”);

hCh = Create(“MyCh4”);

Multicore Training

ARM to Core Communication

• For protection, User’s space does not involved with physical memory. All queues and descriptors manipulations are done by Kernel Space

• A set of user’s space to Kernel space APIs hides the kernel space operation and the hardware from application code (part of the User’s space)

• Kernel’s virtual queue module (VirtQueue) provides the application with pointers to buffers

Multicore Training

Case 4 – Generic Channel Communication

ARM to DSP communications via Linux Kernel VirtQueue

RE

AD

ER

WR

ITE

R


1. Reader create a channel ahead of time with a given name2. When writer has information to write it looks for the channel (find). The kernel is aware of the user’s space

handle3. The write asks for buffer. The kernel dedicate a descriptor to the channel and gives the write a pointer to a

buffer that is associated with the descriptor. The write writes the message into the buffer. 4. The writer put the buffer. The kernel push the descriptor into the right queue. The navigator does loopback

(copy the descriptor data) and free the Kernel queue. Then the navigator load the data into another descriptor and sends it to the appropriate core.

5. When the reader calls get, it gets the message6. The reader responsibility is to free the message after it is done reading

MyCh5

Put(hCh,msg);msg = PktLibAlloc(hHeap);

PktLibFree(msg);

Tibuf *msg =Get(hCh);hCh=Find(“MyCh5”);


Delete(hCh);

RxPKTDMA

TxPKTDMA

Multicore Training

Case 5 – Low-Latency Channel communication

ARM to DSP communications via Linux Kernel VirtQueue

RE

AD

ER

WR

ITE

R


1. Reader create a channel based on one of the pending queues ahead of time with a given name. 2. The reader waits for the message by pending on a (software) semaphore3. When writer has information to write it looks for the channel (find). The Kernel space is aware of the handle4. The write asks for buffer. The kernel dedicate a descriptor to the channel and gives the write a pointer to a buffer that

is associated with the descriptor. The write writes the message into the buffer. 5. The writer put the buffer. The kernel push the descriptor into the right queue. The navigator does loopback (copy the

descriptor data) and free the Kernel queue. Then the navigator load the data into another descriptor , move it to the right queue and generate an interrupt . The ISR post the semaphore to the correct channel

6. The reader start processing the message7. Virtual channel structure enables usage of a single interrupt to post semaphore to one of many channels

PktLibFree(msg);

MyCh6

PktLibFree(msg);


RxPKTDMA

chIRx(driver) Get(hCh); or Pend(MySem);

TxPKTDMA



Delete(hCh);

Multicore Training

Case 6 – Reduce Context Switching

ARM-to-DSP communications via Linux Kernel VirtQueue

RE

AD

ER

WR

ITE

R


1. Reader creates a channel based on one of the accumulator queues. The channel is created ahead of time with a given name.

2. When Writer has information to write, it looks for the channel (find). The Kernel space is aware of the handle.

3. The Writer asks for a buffer. The kernel dedicates a descriptor to the channel and gives the Write a pointer to a buffer that is associated with the descriptor. The Writer writes the message into the buffer.

4. The Writer puts the buffer. The Kernel pushes the descriptor into the right queue. The Navigator does a loopback (copies the descriptor data) and frees the Kernel queue. Then the Navigator loads the data into another descriptor. Then the Navigator adds the message to an accumulator queue.

5. When the number of messages reaches a watermark, or after a pre-defined time out, the accumulator sends an interrupt to the core.

6. The Reader starts processing the message and frees it after it is complete.

MyCh7

PktLibFree(msg);

Msg = Get(hCh);


RxPKTDMA Accumulator

chRx(driver)

TxPKTDMA



Delete(hCh);

Multicore Training

Code Example

ReaderhCh = Create(“MyChannel”, ChannelType, struct *ChannelConfig); // Reader specifies what channel it wants to create

// For each messageGet(hCh, &msg) // Either Blocking or Non-blocking call,pktLibFreeMsg(msg); // Not part of IPC API, the way reader frees the message can be application specific

Delete(hCh);

Writer:hHeap = pktLibCreateHeap(“MyHeap); // Not part of IPC API, the way writer allocates the message can be application

specifichCh = Find(“MyChannel”);

//For each messagemsg = pktLibAlloc(hHeap); // Not part of IPC API, the way reader frees the message can be application specificPut(hCh, msg); // Note: if Copy=PacketDMA, msg is freed my Tx DMA.…msg = pktLibAlloc(hHeap); // Not part of IPC API, the way reader frees the message can be application specificPut(hCh, msg);

Multicore Training

pktlib Library

• Purpose –High level library to allocate packets and manipulate packets used by different types of channels

• Enhance capabilities of packets manipulation

Multicore Training

Heap Allocation

• Heap creation – support shared Heaps and private heaps

• Heap is identified by name. It contains Data buffer Packets or Zero Buffer Packets

• Heap size is determined by application• Typical pktlib functions:

– Pktlib_createHeap– Pktlib_findHeapbyName– Pktlib_allocPacket

Multicore Training

Packets Manipulations

• Merge multiple packets into one (linked) packet

• Clone packet• Split Packet into multiple packets• Typical pktlib functions:

– Pktlib_packetMerge– Pktlib_clonePacket– Pktlib_splitPacket

Multicore Training

Pktlib additional features

• Clean up and garbage collection (especially for clone packets and split packets)

• Heap statistics• Cache coherency

Multicore Training

RESMGR Library

• Purpose – set of utilities to manage and distribute system resources between multiple users and applications

• The application asks for a resource. If the resource is available it get it. Otherwise, and error is return

Multicore Training

RESMGR Controls

• General purpose queues• Accumulator Channels• Hardware semaphores• Direct Interrupt queues• Memory region request

Date post:	06-Feb-2016
Category:	Documents
Upload:	quynh
View:	41 times
Download:	0 times

C6614/6612 Memory System

Documents