Blackfin Speedway Presentation Core, Memory, and Peripherals

Support Across The Board™

Blackfin Speedway PresentationCore, Memory, and Peripherals

Copyright © Avnet, Inc., Analog Devices, Inc. All rights reserved.

Blackfin as a Convergent Processor

Commonly asked questions:

What makes Blackfin a “convergent” processor?

What architectural features enable convergent processing?

What type of performance can Blackfin achieve from a networking standpoint?


Agenda

• Blackfin “Convergent Processing”• Blackfin Core Details

– Registers– ALU, MAC, Shifter

– Sequencer, Pipeline, Event Controller • Blackfin Memory

– Memory Architecture– Cache

• Peripherals– General Peripherals (UART,SPORT, SPI, TWI, WD, RTC)– Ethernet, CAN– PPI– DMA


What architectural features enable convergent processing?

• Integrated instruction set architecture

– Single instruction set for signal processing and control

• Programmable interrupt levels

– Real-time tasks get the highest priority level

• Memory protection with an MMU

– Regions of memory can be protected from access

• Networked peripherals in addition high speed connectivity to ADC, DAC and video peripherals

• Unified address space and byte addressable

• Support for User and Supervisor modes

• Robust ALU including both signal processing functions as well as traditional MPC/MPU functions


What makes Blackfin a Convergent Processor?

• Blackfin has a mature compiler that produces highly optimized code (with an option to produce “dense code” for control applications)

• Blackfin processors come with a full suite of C-based device drivers for peripherals– Fully documented, common APIs

• Blackfin beats the competition in terms of DSP benchmarks and it is on par with ARM code density benchmarks

• Blackfin is scalable across a broad set of applications– ADSP-BF531 on the low end– Dual-core ADSP-BF561 on the high end

• Latest peripheral integration expands connectivity to network-based applications

• Large set of options for OS and kernel support, including uCLinux


Blackfin ADSP-BF536/537 Architecture

Overview


Blackfin Architecture Basics

CoreRegisters

ALU, MAC, Shifter

Data Addressing Modes

Program Sequencer

Event Controller

Peripherals

Instruction Set Overview

MemoryArchitecture

Cache


Section 1

Register File


Accessing Registers

• Blackfin processors are register-intensive devices

– All computations are performed on data contained in registers

– All peripherals are setup using registers

– Memory is accessed using pointers in address registers

• There are two types of Blackfin processor registers

– Core registers

– Memory-mapped registers (MMRs)


Blackfin Core Registers

• Core registers are accessed directly by name– Data Registers: R0-R7

– Accumulator Registers: A0, A1

– Pointer Registers: P0-P5, FP, SP,USP

– DAG Registers: I0-I3, M0-M3, B0-B3, L0-L3

– Cycle Counters: CYCLES, CYCLES2

– Program Sequencer: SEQSTAT

– System Configuration Register: SYSCFG

– Loop Registers: LT[1:0], LB[1:0], LC[1:0]

– Interrupt Return Registers: RETI, RETX, RETN, RETE Example:

R0 = SYSCFG; // Load data register with contents of SYSCFG register


Core Registers

LT0LB0

Loop CounterLoop TopLoop Bottom

ASTAT

RETS

RETI

RETX

RETN

RETE

Arithmetic Status

Subroutine Return

Interrupt Return

Exception Return

NMI Return

Emulation Return

LT1LB1

System Config

Sequencer Status

SYSCFG

SEQSTAT

LC0

LC1

I0

I1

I2

I3

L0

L1

L2

L3

B0

B1

B2

B3

M0

M1

M2

M3

31 0 31 0 31 0 31 0

P0

P1

P2

P3

P4

P5

31 0

FP

SP

USP

Address Registers

R0

R1

R2

R3

R4

R5

R6

R7

R0.LR0.H

R1.LR1.H

R4.LR4.H

R7.LR7.H

1531

A1.H A0.L

A0.H A0.L

A1X

A0X

Data Registers

1531

Shaded registers only accessible in Supervisor mode

39


Memory-Mapped Registers (MMRs)

• A majority of registers are memory-mapped and must be accessed indirectly– Core MMRs are used to configure the core registers

• They are listed in Appendix A of the HRM• All Core MMRs must be accessed with 32-bit reads or writes

– System MMRs are used to configure all other peripherals• They are listed in Appendix B of the HRM• Some System MMRs must be accessed with 32-bit reads or writes and others

with 16-bit reads or writes (See the HRM for details)

• MMR addresses are defined in header files– defBF53x.h for assembly– cdefBF53x.h for C/C++

• MMRs can only be accessed in Supervisor mode

Assembly Example:P0.H = HI(SPI_RDBR); // load upper 16-bits of SPI Receive Register address to pointer registerP0.L = LO(SPI_RDBR); // load lower 16-bits of SPI Receive Register address to pointer registerR0 = W[P0] (z); // read 16-bit SPI Receive Register (SPI_RDBR) into data register

C/C++ Example:short temp; // define variable to store contentstemp = *pSPI_RDBR; // read 16-bit SPI Receive Register contents into data element


Section 2

Arithmetic Logic Units (ALU)


Arithmetic Logic Unit (ALU)

Data Arithmetic Unit

A1

40barrelshifter

A0

40

1616

8 8 8 8

LD0 32-bits

LD1 32-bits

SD 32-bits

R0

R1

R2

R3

R4

R5

R6

R7

R0.L

R1.L

R2.L

R3.L

R4.L

R5.L

R6.L

R7.L

R0.H

R1.H

R2.H

R3.H

R4.H

R5.H

R6.H

R7.H


Arithmetic Logic Unit (ALU)

• Two 40-bit ALUs operate on 16-bit, 32-bit, and 40-bit input data and output 16-bit, 32-bit, and 40-bit results.

• Functions

– Fixed-point addition and subtraction

– Addition and subtraction of immediate values

– Accumulation and subtraction of multiplier results

– Logical AND, OR, NOT, XOR, bitwise XOR (LFSR), Negate

– Functions: ABS, MAX, MIN, Round, division primitives

– Supports conditional instructions

• Four 8-bit video ALUs


40-bit ALU Operations

• 40-bit ALU operations support the following combinations:

– Single 16-Bit Operations

– Dual 16-Bit Operations

– Quad 16-Bit Operations

– Single 32-Bit Operations

– Dual 32-Bit Operations


Section 3

Multiply-Accumulators (MAC)




A1

40barrel

shifter

A0

40

1616

8 8 8 8

LD0 32-bits

LD1 32-bits

SD 32-bits

R0

R1

R2

R3

R4

R5

R6

R7

R0.L

R1.L

R2.L

R3.L

R4.L

R5.L

R6.L

R7.L

R0.H

R1.H

R2.H

R3.H

R4.H

R5.H

R6.H

R7.H



• Two identical MACs

– Each performs fixed-point multiplication and multiply-accumulate operations on 16-bit fixed-point input data and outputs 32-bit or 40-bit results.

• Functions

– Multiplication

– Multiply-accumulate with addition

– Multiply-accumulate with subtraction

– Dual versions of the above

• Features

– Saturation of accumulator results

– Optional rounding of multiplier results


Section 4

Barrel-Shifter (Shifter)




A1

40barrel

shifter

A0

40

1616

8 8 8 8

LD0 32-bits

LD1 32-bits

SD 32-bitsR0

R1

R2

R3

R4

R5

R6

R7

R0.L

R1.L

R2.L

R3.L

R4.L

R5.L

R6.L

R7.L

R0.H

R1.H

R2.H

R3.H

R4.H

R5.H

R6.H

R7.H



• Performs bitwise shifting for 16-bit, 32-bit or 40-bit inputs and yields 16-bit, 32-bit, or 40-bit outputs.

• Shift Functions

– Arithmetic Shifts preserve the sign of the original number. The sign bit value back-fills the left-most bit positions vacated by the arithmetic right shift.

– Logical Shifts discard any bits shifted out of the register and back-fills vacated bits with zeros.



• Additional Functions

– Rotate: Rotates a registered number through the CC bit a specified distance and direction.

– Bit Operations – Set, Clear, Toggle, Test

– Field Extract and Deposit


Section 5

Data Addressing Modes


Address Registers

I0

I1

I2

I3

L0

L1

L2

L3

B0

B1

B2

B3

M0

M1

M2

M3

31 0 31 0 31 0 31

P0

P1

P2

P3

P4

P5

31 0

FP

SP

USP

Address Registers

One set of 32-bit general-purpose Pointer registers P0-P5, SP and FP

One set of 32-bit DSP buffer addressing registers I0-I3, B0-B3, L0-L3, M0-M3

All addresses are byte addresses into a 4 GB address space

SP points to supervisor stack in Supervisor mode and user stack in User mode

USP is accessible in supervisor mode only – Allows access to user stack location while in Supervisor mode


Addressing Methods

• Register Indirect Addressing

– Index Registers (32-bit and 16-bit accesses)

– Pointer Registers P0 – P5 (32-bit, 16-bit, and 8-bit accesses)

– Stack and Frame Pointer Registers (32-bit accesses)

• Types of address pointer modify

– Modify/Post-Modify

• Linear addressing

• Circular buffering / modulo addressing

– Enables automatic maintenance of pointers to stay within bounds of a circular buffer

• Bit-Reversal (Modify only)

– Pre-Modify with update (using Stack Pointer)

– Pre-Modify without update


Linear vs Circular Buffering

• Linear Buffer Access– Index (I0:3) registers hold the address sent out on the address

bus.– Length (L0:3) register set to 0, thus disabling circular buffering.

• Default for C compiler• Provisions in compiler to allow circular buffers

– Modify (M0:3) registers contain the value (positive or negative) that is added to the I registers at the end of each memory access.

• Circular Buffer Access– Base (B0:3) registers contain the circular buffer’s start address.– Length (L0:3) register set to length of circular buffer.– Modify (M0:3) value must be less than or equal to the length of the

circular buffer.– Indexing wraps back to Base address when Index modification

exceeds Base + Length


Circular Buffer Example

0x00000001

0x00000002

0x00000003

0x00000004

0x00000005

0x0000000B

0x00000006

0x00000007

0x00000008

0x00000009

0x0000000A

0x00000001

0x00000002

0x00000003

0x00000004

0x00000005

0x0000000B

0x00000006

0x00000007

0x00000008

0x00000009

0x0000000A

Address

0

4

8

C

10

14

18

1C

20

24

28

Base Address and Starting Index Address (B0 = 0; I0 = 0;) Buffer Length is 44 (L0 = 44;)

There are 11 data elements and each data element is 4 bytes Modify Value is 16 (M0 = 16;)

4 elements * 4 bytes/element

1st Access

2nd Access

5th Access

4th Access

3rd Access


Section 6

Program Sequencer


• Controls all program flow

• Contains a 10-stage instruction pipeline

• Maintains in-program branching

– Subroutines

– Jumps

– Interrupts and Exceptions

• Maintains loops

– Includes zero-overhead loop registers

– No cost for wrapping from loop bottom to loop top

Program Sequencer Features


Blackfin Execution Pipeline

• 10-stage super-pipeline

• Sequencer ensures that the pipeline is fully interlocked and that all the data hazards are hidden from the programmer

• If executing an instruction that requires data to be fetched, the pipeline will stall until that data is available– See EE-197 application note for a complete list of stalls and multi-cycle

instructions: http://www.analog.com/ee-notes


Avoiding Pipeline Stalls

Most common numeric operations have no instruction latency

VisualDSP++ Pipeline Viewer highlights Stall and Kill conditions


Sequencer-Related Registers


Section 10Section 7

Event Controller


Events (Interrupts / Exceptions)

• The Event Controller manages 5 types of Events

– Emulation (via external pin)

– Reset (via SW or external pin)

– Non-Maskable Interrupt (NMI) - for events that require immediate processor attention (via SW, external pin, or Watchdog)

– Exception

– Interrupts• Hardware Error• Core Timer• 9 General-Purpose Interrupts for servicing peripherals

– Can be custom prioritized for optimal system performance

• All events can be serviced by Interrupt Service Routines (ISR)


Interrupts vs. Exceptions

INTERRUPTS• Hardware-generated

– Asynchronous to program flow

– Requested by a peripheral• Software-generated

– Synchronous to program flow– Generated by RAISE

instruction• All instructions preceding the

interrupt in the pipeline are killed

EXCEPTIONS• Service Exception

– Return address (RETE) is the address following the excepting instruction

– Never re-executed– EXCPT instruction is in this

category• Error Condition Exception

– Return address (RETE) is the address of the excepting instruction

– Excepting instruction will be re-executed

The Blackfin is always in Supervisor Mode while executing Event Handler software and can be in User Mode only while executing application tasks.


BF533 System and Core Interrupt Controllers

Emulator 0 EMU

Reset 1 RST

Non Maskable Interrupt 2 NMI

Exceptions 3 EVSW

Reserved 4 -

Hardware Error 5 IVHW

Core Timer 6 IVTMR

General Purpose 7 7 IVG7









PLL Wakeup interrupt IVG7

DMA error (generic) IVG7

PPI error interrupt IVG7

SPORT0 error interrupt IVG7

SPORT1 error interrupt IVG7

SPI error interrupt IVG7

UART error interrupt IVG7

RTC interrupt IVG8

DMA 0 interrupt (PPI) IVG8

DMA 1 interrupt (SPORT0 RX) IVG9

DMA 2 interrupt (SPORT0 TX) IVG9

DMA 3 interrupt (SPORT1 RX) IVG9

DMA 4 interrupt (SPORT1 TX) IVG9

DMA 5 interrupt (SPI) IVG10

DMA 6 interrupt (UART RX) IVG10

DMA 7 interrupt (UART TX) IVG10

Timer0 interrupt IVG11



PF interrupt A IVG12

PF interrupt B IVG12

DMA 8/9 interrupt (MemDMA0) IVG13

DMA 10/11 interrupt (MemDMA1) IVG13

Watchdog Timer Interrupt IVG13

Event Source IVG # Core Event Name

System Interrupt Source IVG # 1

1 Note: Default IVG configuration shown.

Highest

Lowest

P r

i o

r i t

y


Event Processing Flow


Interrupt Service Routine (ISR)

• ISR address is stored in the Event Vector Table– Used as the next fetch address when the event occurs

• Program Counter (PC) address is saved to a register– RETI, RETX, RETN, RETE, based on event

• Always concludes with “Return” Instruction– RTI, RTX, RTN, RTE (respectively)– When executed, PC is loaded with address stored in

RETI, RETX, RETN, or RETE to continue app code • Optional nesting of higher-priority interrupts possible

– See appnote EE-192, which covers writing interrupt routines in C (http://www.analog.com/ee-notes)


Section 8

Blackfin Peripherals


Peripherals and Power Management

Common Peripherals (All Blackfins)• SPI, UART, SPORT, WD, RTC• PPI

BF534/BF536/BF537 Peripherals• TWI, CAN

BF536/BF537 Peripheral• Ethernet

DMA and Handshake DMA

Power Manager


Three Serial Communication Peripherals• SPI (Serial Peripheral Interface)

– High-Speed SPI port (up to SCLK/4, max 33.25 MHz)• Master/Slave compatible with control of up to 7 slave-selects• Single-Duplex DMA (Either TX or RX)

– Typically used to interface with serial EPROMS, CPUs, converters, and displays• UART (Universal Asynchronous Receiver/Transmitter)

– PC-style UART port (baud rate up to SCLK/16, max 8.3125 MHz)• Supports half-duplex IrDA SIR (9.6/115.2 Kbps rate)• Autobaud detection support through the use of the Timers• Separate TX and RX DMA support

– Typically used for maintenance port or interfacing with slow serial peripherals• SPORTs (Synchronous Serial Ports)

– High Speed Serial Port (up to SCLK/2, max 66.5 MHz)• Variable word length support (3 - 32 bits)• I2S-Compatible• Separate TX and RX DMA support• 128 Channels out of 1024-Channel Window for TDM support• Primary and Secondary Data channels

– Typically used for interfacing with CODECs and TDM data streams


Real-Time Clock Features

• Used to implement real-time watch or “life counter”– Time of day, alarm, stopwatch count-down, and elapsed

time since last system reset• Uses four counters - Seconds, Minutes, Hours, Days• Equipped with two alarm features

– Daily and Day-And-Time• Uses dedicated 32.768 kHz crystal to RTXI / RTXO

– Can be pre-scaled to 1 Hz to count in real-time seconds• Uses dedicated power supply pins

– Independent of any reset• Can take processor out of all low-power states


PPI – What is it?

• Parallel Peripheral Interface

– Programmable bus width (from 8 – 16 bits in 1-bit steps)

– Bidirectional (half-duplex) parallel interface

– Synchronous Interface

• Interface is driven by an external clock (“PPI_CLK”)

• Up to 66MHz rate (SCLK/2)

• Asynchronous to SCLK

– Includes three frame syncs to control the interface timing

– Applications

• Driving LCD Interface

• General Purpose Interface to outside world

• High speed data converters

• Video CODECs


TWO-WIRE INTERFACE (TWI)

• Fully compliant to the Philips I2C bus protocol– See Philips I2C Bus Specification version 2.1

• 7-bit addressing• 100 Kb/s (normal mode) and 400Kb/s (fast mode) data rates• General call address support

• Supports Master and Slave operation– Separate receive and transmit FIFOs

• SCCB (Serial Camera Control Bus) support– Only in Master mode

• Slave mode cannot be used because the TWI controller always issues an Acknowledge in slave mode


Controller Area Network (CAN)

• Adheres fully to CAN V2.0B standard– Supports both standard (11-bit) and extended (29-bit) Identifiers– Data Rates up to 1Mbit/second

• 32 Configurable Mailboxes– 8 dedicated transmitters and 8 dedicated receivers– 16 configurable (transmit or receive)

• Dedicated Acceptance Mask for each Mailbox

• Data Filtering (first two bytes) can be used for Acceptance Filtering

• CAN wakeup from Hibernation (lowest static power consumption) Mode

• CAN Protocol Stacks– Automotive: CAN drivers and protocol stacks through Vector CANtech – Industrial: Leading third parties will provide a full Industrial suite for

CANOpen, DeviceNet, etc.


ADSP-BF536/537 Family Ethernet MAC Features

ADSP-BF536/537 Ethernet MAC has advanced features beyond IEEE 802.3: For improved performance:

Automatic Checksum Computation for IP Header and Payload on RX Frames Programmable RX Data Alignment Mode for 32-bit Alignment Independent RX & TX DMA Channels with Delivery of Frame Status to Memory System Wakeup on Magic Packet for 4 User-Definable Wakeup Frame Filters

For lower overall system cost: No PHY XTAL required – Buffered XTAL output from processor feeds PHY Connection to either MII or RMII PHY

ADSP-BF536/537 enhances throughput and dataflow via these features: Enhanced DMA channels allow for processor core independence Direction Control to exploit SDRAM physics Four SDRAM rows can be ‘open’ at any given time

ADSP-BF536/537 overall networking bandwidth:Full 100Mbps wire speed on 1400-bit payload with an optimized networking stack

UDP : ~44% processor core loadingTCP/IP: ~75% processor core loading


ADSP-BF536/537 DMA Enhancements

• 4 additional DMA channels– All 12 peripheral DMA channels can be assigned to any

of the peripherals

• Provides MAC further control over the assigned DMA channels– Can reload DMA registers if incorrect checksum is detected

• Two External Handshaking Memory DMA Controllers– Good for asynchronous FIFOs or off-chip interface controllers

between Blackfin memory and hardware buffers


Variable Frequency

Clock dividers (1x to 63x) enable low latency changes in system performance

Variable Voltage

On-Chip Voltage Regulator generates accurate voltage from 2.25 – 3.6V input

Core voltage programmable from 0.8V to 1.2V (50 mV increments)

Maximum 40usec latency for PLL to relock (Frequency or Voltage changes)

System Cost Reduction

Po

we

r (m

W)

600 MHz, 1.2V, 264 mW

200 MHz, 1.2V, 156 mW

500 MHz, 1.2V

500 MHz, 1.0V

Frequency Only

Voltage & Frequency

Power Savings

Audio ProcessingVideo Processing

Blackfin – Dynamic Power Management Increases Battery Life

200 MHz, 0.8V, 90 mW


Section 9

Instruction Set Overview


Instruction Set Description

• Full-featured flexible multifunction instructions

• Employs an algebraic-style syntax

• Optimized to allow access to many of the processor core resources within a single instruction

• Compiled C and C++ source code makes optimal use of instructions

• Format designed for ease of coding and readability

• Tuned to generate dense code (small memory size footprint)


Blackfin Assembly Language Features

• Multi-issue load/store modified-Harvard architecture supports – Two 16-bit MAC or four 8-bit ALU + two load/store + two

pointer updates per cycle.• Unified 4G byte memory space

– All registers, I/O, and memory are mapped a unified 4G byte memory space

– Providing a simplified programming model• Microcontroller features:

– Arbitrary bit and bit-field manipulation, insertion, and extraction– Integer operations on 8-, 16-, and 32-bit data-types– Separate user and supervisor stack pointers

• Code density enhancements– Intermixing of 16- and 32-bit instructions (no mode switching,

no code segregation)– Frequently used instructions are encoded in 16 bits.


Blackfin DSPs Code Density

Instruction Set Tuned for Compact Code Multi-length Instructions

• 16, 32-bit Opcodes• Limited Multi-Issue

Compact Call/Return

No Memory Alignment Restrictions for Code Transparent Alignment HW Blackfin Supports 16 and 32-

bit Memory Systems

16-bit OP32-bit OP

16-bit widememory

015

64-bit Multi-OP Packet

031

32-bit widememory

No Memory Alignment Restrictions: Maximum Code Density and Minimum

System Memory Cost

Instruction Formats


Blackfin Code Density Features

Free intermixing of 16/32-bit instructions - no mode switching, no code segregation

Frequently used instructions encoded as 16-bits

3-bit register fields

Conditional moves

Push/Pop multiple registers

Three operand instructions

Single condition bit and evaluation


Data MovementLD, ST, 8,16,32 bitsUnsigned, Sign-extendRegister moves, P-D-DAG,Push, Pop, Push/PopmultCC to dreg, etc.

Addressing ModesAuto incr, Auto decr,Pre-decr store on SP,IndirectIndexed w/immed offsetPost-incr w/ nonunity strideByte addressable

Program ControlBRCC, UJUMP,Call, RETS, Loop Setup

Arithmetic+,-,*,/,>>>, Negate2 and 3 operand instructs

LogicalAND, OR, XOR, NOTBITtst,set,tgl,clr, CC ops<<,>>

VideoSAA, Byteops: Residual calc,Spatial Interpolation, SpatialFilter

Cache ControlPrefetch, Flush

A DSP with a RISC instruction set and a MMU, an event controller and a wide range of peripherals

Supervisor/user modes

Memory management

Wide range of peripherals

Event control

Blackfin Dual Operational Model


Blackfin MicroController Features

Arbitrary bit and bit-field manipulation, insertion and extraction

Integer operations on 8/16/32-byte data-types

Memory protection and separate user and supervisor stack pointers

Scratch SRAM for context switching

Population and leading digit counting

Byte addressing DAGs

Compact Code Density


Section 10

Blackfin Memory


ADSP-BF536/7 at a Glance

BlackfinProcessor

L1Instruction

L1Data A

L1Data B

64 bit

25MHzXTAL

EnetPHY

25MHz Enet Data SDRAM

Rows are “open” in 4 SDRAM banks

reducespage activation

ExtBus

W/directionControl

No need for second XTAL

PLL VCO

4 sub-banks allow 2 core accesses at

same time as DMA access

1:64X131MHz

DMA

2 core fetches

or 1 fetch and 1 store

16

Max Bandwidth 266MB/sec

32

Makes best use

of SDRAM

525 MHz

Large enough to run application code

Cache available if operations from SDRAM

are desired

Programmable frequency and voltage control


Memory Hierarchy on the Blackfin

• As processor speeds increase (300Mhz – 1 GHz), it becomes increasingly difficult to have large memories running at full speed.

• The BF5xx uses a memory hierarchy with a primary goal of achieving memory performance similar to that of the fastest memory (i.e. L1) with an overall cost close to that of the least expensive memory (i.e. L2)

L2 Memory

External Larger capacityHigher latency

L1 Memory

InternalSmallest capacity

Single cycle access

CORE

(Registers)L3 Memory

External Largest capacityHighest latency


Memory Architecture: The Basics

Core

L1 Instruction Memory

L1 Data Memory

External Memory

L1 Data Memory

External MemoryExternal MemoryUnified L3External Memory

Unified L2

Single cycle toaccess

10s of Kbytes

Several cycles to access 100s of Kbytes

Several system cycles to access

100s of Mbytes

>600MHz

>600MHz

>300MHz

<133MHz

On-chip

Off-chip

DMA


Configurable Memory

• Best system performance can be achieved when executing code or fetching data out of L1 memory

• Two methods can be used to fill L1 memory – Caching and Dynamic Downloading – Blackfin Processor supports both– General Purpose processors have typically used the

caching method, as they often have large programs residing in external memory and determinism is not as important.

– DSPs have typically used dynamic downloading, as they need direct control over which code runs in the fastest memory.

• Blackfin processors allow the programmer to choose one or both methods to optimize system performance.


What is Cache?

• In a hierarchical memory system, cache is the first level of memory reached once the address leaves the core (i.e L1)– If the instruction/data word (8, 16, 32, or 64 bits) that

corresponds to the address is in the cache, there is a cache hit and the word is forwarded to the core from the cache.

– If the word that corresponds to the address is not in the cache, there is a cache miss. This causes a fetch of a fixed size block (which contains the requested word) from the main memory.

• The Blackfin allows the user to specify which regions (i.e. pages) of main memory are cacheable and which are not through the use of CPLBs (more on this later).

– If a page is cacheable, the block (i.e. cache line containing 32 bytes) is stored in the cache after the requested word is forwarded to the core

– If a page is non-cacheable, the requested word is simply forwarded to the core


Cache Hits and Misses

• A cache hit occurs when the address for an instruction fetch request from the core matches a valid entry in the cache.

• A cache hit is determined by comparing the upper 18 bits, and bits 11 and 10 of the instruction fetch address to the address tags of valid lines currently stored in a cache set.

• Only valid cache lines (i.e. cache lines with their valid bits set) are included in the address tag compare operation.

• When a cache hit occurs, the target 64-bit instruction word is sent to the instruction alignment unit where it is stored in one of two 64-bit instruction buffers.

• When a cache miss occurs, the instruction memory unit generates a cache line-fill access to retrieve the missing cache line from external memory to the core.


L1 Instruction Memory 16KB Configurable Bank

Instruction

DCB- DMA

4KBsub-bank

EAB – Cache Line Fill

4KBsub-bank

4KBsub-bank

4KBsub-bank

16 KB cache

• 4-way set associative with arbitrary locking of ways and lines

• LRU replacement

• No DMA access

16 KB SRAM

• Four 4KB single-ported sub-banks

• Allows simultaneous core and DMA accesses to different banks


L1 Data Memory 16KB Configurable Bank

Block is Multi-ported when:Accessing different sub-bank

OR

Accessing one odd and one even access (Addr bit 2 different) within the same sub-bank.

Data 1

Data 0

4KBsub-bank

4KBsub-bank

4KBsub-bank

4KBsub-bank

• When Used as Cache– Each bank is 2-way

set-associative– No DMA access– Allows simultaneous

dual DAG access

• When Used as SRAM– Allows simultaneous

dual DAG and DMA access

DCB- DMA

EAB – Cache Line Fill

Date post:	04-Jan-2016
Category:	Documents
Upload:	zavad
View:	31 times
Download:	0 times

Blackfin Speedway Presentation Core, Memory, and Peripherals

Documents