Purpose • The intent of this module is to provide you with an overview of the
i.MX31 CPU complex.
Objectives • Describe the ARM1136 core platform.• Identify features of the ARM1136JF-S processor.• Describe the two levels of caches in the CPU complex.• Identify the purpose of the Smart Speed™ switch.
Content• 15 pages• 3 question
Learning Time• 25 minutes
Module Introduction
The intent of this module is to provide you with an overview of the CPU complex of the i.MX31 processor. You will learn about the ARM1136JF-STM processor, the cache write strategy, and the Level 2 (L2) cache system. You will also learn about the Smart SpeedTM switch, and the Vector Floating-Point (VFP) co-processor. It should be noted that, unless specifically mentioned, all information in this module applies to both the i.MX31 and the i.MX31L.
ARM1136 Core Platform
AlternateBus
Masters
Primary AHB
1136 Core
Smart Speed TM SwitchMulti AHB Crossbar
Patch
16 KbytesD Cache
16 KbytesI Cache
AHB 1,2,3
PeripheralI/F 1
PeripheralI/F 2
Mem Ctl
L2 Cache Cntrl128 KbytesL2 Cache Interrupt
FPU ETM ETB4 Kbytes
Let’s start by looking at the Freescale ARM®1136 core platform. The CPU complex of the i.MX31 consists of the ARM1136JF-S processor, an L2 cache system, the Smart Speed switch, and an ARM11™ Vector Interrupt Controller (AVIC).
The multilevel cache system consists of a powerful L2 Cache Controller (L2CC) that has been optimized by ARM® to Freescale specifications with 128 Kbytes of unified L2 cache memory and an integrated L2 cache monitor. The L1 cache provides 16 Kbytes for instruction and 16 Kbytes for data.
The Smart Speed switch, otherwise known as the 6 × 5 Multi-Layer AHB Crossbar switch (MAX), allows for up to five simultaneous transactions to occur in parallel, giving the performance of up to a 665 MHz bus.
The VFP11 Floating Point Unit (FPU) is an ARM-enhanced IEEE 754 numeric co-processor that can be used to support and enhance 3D graphics, gaming, high resolution audio, Java™ and other general-purpose applications.
ARM1136 Core Features• High performance core platform
• ARM1136 core with:
– 8 stage pipeline
– 16 Kbyte instruction and 16 Kbyte data caches
– 64-bit data paths to memory offers increased bandwidth
– Jazelle hardware for Java acceleration
• Vector Floating Point Unit (VFP)
• Trace module with 4 Kbyte buffer for SW debug
• Freescale Smart Speed switch:
– 5 simultaneous 32-bit transfers increases performance
– Programmable priorities optimize system performance
• 128 Kbytes L2 cache for up to 30 percent increased system performance
– Freescale was the lead partner with ARM
• Enhanced hardware-assisted interrupts for faster response
• Flexible power management techniques
• Dynamic voltage frequency scaling modes:
– High speed: 532 MHz @ 1.45V
– Medium speed: 400 or 266 MHz @ 1.1V
– Idle speed: 133 MHz @ 1.1V
Reference material for previous page
ARM1136JF-S Processor
Let’s look at the core of the ARM11 platform, which is the ARM1136JF-S processor. In this module, it is referred to as simply the “ARM11.”
The ARM11 incorporates an integer unit that implements the ARM V6 architecture. It supports the ARM and Thumb™ instruction sets, Jazelle™technology to enable direct execution of Java byte codes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers.
ARM1136JF-S Features
• Synthesizable• ARM V6 architecture:
– ARM, THUMB, Jazelle– Mixed endian support– Unaligned data support– Physically addressed caches– Media extensions
• High performance core:– 8-stage pipeline– Branch prediction– Return stack
• VFP co-processor• Fast Interrupt mode• 16 Kbyte I- and D- caches
Reference material for previous page
ARM1136JF-S: Key Benefits• ARM V6 architecture
– ARM and Thumb instruction sets– Jazelle technology enabling direct execution of Java byte codes– a range of SIMD DSP instructions which operate on 16-bit or 8-bit data
values in 32-bit registers• Power and area efficient• Synthesizable design• Complete set of supporting system IP• Backwards compatible with previous ARM processors• Provides full virtual memory capabilities• Physical address tagging for caches and Application Space Identifiers
(ASIDs)– Reduces overhead on context switches– Reduces cache invalidation and refill– Saves cycles and power
Reference material for previous page
ARM V6 BenefitsImproved: • CPU efficiency and performance
• Multimedia performance
• Real-time performance
• Data sharing with non-ARM execution units
• Application portability from non-ARM processors
• Unaligned and mixed endian support
The ARM1136 is the first processor implementation of the ARM V6 architecture. Let’s look at how this architecture improves some CPU and multimedia functionalities.
The ARM V6 architecture improves CPU efficiency and performance and multimedia performance, which includes media processing extensions, two times faster MPEG-4 encode/decode, and faster audio DSP than the ARM926. Another improvement is real-time performance, which includes faster exception and interrupt handling, vectored interrupt support, reduced latency mode, and new stack and mode change instructions that have a three times faster interrupt entry. The ARM V6 architecture improves data sharing with non-ARM execution units, application portability from non-ARM processors, and unaligned and mixed endian support.
System Metrics
L1 Instruction SideCache Controller
Prefetch Unit
DebugJTAG ETM VIC
External Co-processorInterface
Instruction Fetch DRead DWrite Peripherals
ARM1136 Processor Block Diagram
ARM 11 Core LSUI Cache
MainTLB
L1 Data SideCache Controller
VFP
D Cache
Let’s continue to examine the ARM11 processor by looking at a functional block diagram. The features include an integer unit an eight-stage pipeline, branch prediction with return stack low interrupt latency external co-processor interface and co-processor 14 and 15, instruction and data MMUs (managed using micro TLB structure backed by a unified main TLB), and instruction and data caches (including non-blocking D cache with Hit-Under-Miss). Note that the caches are virtually indexed and physically addressed, and there is a 64-bit interface to both caches.
Other features include a write buffer that can be bypassed, a high-speed Advanced Microcontroller Bus Architecture (AMBA) L2 interface supporting prioritizing multiprocessor implementations, an AMBA bus interface (AHB-lite protocol), a Floating Point co-processor, trace support, JTAG-based debug, and a Load Store Unit (LSU).
The ARM11 processor features an interrupt service to quickly determine the interrupt source and branch to the interrupt service routine. The ARM11 solution contains an Interrupt vector port, and a Vector Interrupt Controller (VIC).
Question
The ARM1136 is the first processor implementation of the ARM V6 architecture. What are some of the ARM V6 architecture improvements? Select all that apply and then click Done.
CPU efficiency and performance
Vectored interrupt support
Data sharing with non-ARM execution units
5-stage pipeline
Done
Consider this question concerning the ARM1136 processor.
Correct.
The V6 architecture includes CPU efficiency and performance, real-time performance, which includes vectored interrupt support, and data sharing with non-ARM execution units. The ARM11 processor also contains an eight-stage processor.
ARM V6 Memory Model
ARM CoreLevel 1Caches
Level 2Cache
DRAM
SRAM
Flash
ROM
AddressTranslation
AdditionalProcessor(s)
InstructionPrefetch
Load
Store
CP15 Configuration/Control
Physical AddressVirtualAddress
R15...
R0
• Level 1 cache memory fully defined in ARM V6• Hierarchy and memory order support for Level 2 cache
EMI
SRAM ROMARM Platform
SOC
Now let’s look at the V6 memory model to explore cache in greater detail. There are two levels of caches in the CPU complex. Level 1 (L1) consists of separate instruction and data caches, a write buffer, two micro TLBs backed by a main TLB, Application Space Identifiers (ASIDs), and memory system attributes. The Level 1 cache memory subsystem is fully defined in ARM V6, and ARM V6 also has hierarchy and memory order support for the Level 2 cache. The Level 2 cache is unified, and will therefore hold both instruction and data elements.
The cache is virtually indexed and physically addressed. Line length is fixed at eight words. The ARM1136 cache is four-way set-associative. A particular address may be stored in one of four locations within the cache. To check for a cache hit for a non-sequential access, address comparisons must be performed with four different tag values. To prevent this comparison from reducing the maximum core clock frequency, there is a minimum one cycle latency between the comparison matching and the writing of data to that cache line. This requires a small Write buffer to be implemented in the cache, to allow written words to be held until they can be written.
Cache-related Definitions
• Line: Smallest loadable unit of a cache that is always a block of contiguous words in memory.
• Tag: The portion of a memory address that is stored within the cache to identify the particular physical address located there.
• Set: The set of cache lines that can hold data from a particular memory location.
• Way: The number of sets in the cache is the number of “ways” in the cache.
• Index: The portion of the memory address that determines the set in which the cache line may be stored.
Reference material for previous page
Now, let’s look at cache write strategies. The write buffer is used to decouple memory writes. Data is placed in the buffer at core speed and is written to memory at bus speed in parallel. A FIFO holds a set of addresses and a set of data words and size information. A sequence of data words in the write buffer require only the first address. The address of a new access may be compared against write buffer addresses. A separate FIFO is maintained for cache Write Back operations. This avoids complications associated with performing an external write while handling a write-through store operation.
With write-through, if the location is in the cache, the memory update is stored in the cache and in the write buffer, which performs the write so that the main processor does not have to slow down to main memory speed.
With Write Back, if location is in the cache, only the cache is updated and the “dirty”bit is set to show that the cache line must be written back to main memory before the line is reused.
Please note that if the data location is not contained within the cache, the data will be written directly to memory. The write buffer will be used if the region is bufferable or cacheable.
ExternalMemory
CPU
Cache
WriteBuffer
ExternalMemory
CPU
Cache
WriteBuffer
Write Through:If location is within the cache, the cache is updated.
Write is also sent to memory via the Write Buffer
ExternalMemory
CPU
Cache
WriteBuffer
Write Back:If location is within the cache, only the cache is
updated
L2MemorySystem
CPU
Cache
WriteBuffer
WB
WT
Access Mode
Non cacheable, non bufferable0 0
C B
0 Non cacheable, bufferable
1
1 WT, Write Through01 WB, Write Back1
Cache Write Strategy
Write Through:If location is within the cache, the cache is updated. Write is also sent to memory via the Write Buffer
Write Back:If location is within the cache, only the cache is updated.
ARM L2 Cache
• Improves the performance of computer systems when significant memory traffic is generated by the CPU
• Fastest memory access is via the L1 cache, followed closely by the L210; access is significantly slower to the main memory (L3)
• Is 128 Kbytes on the ARM1136 core platform
• Has a fixed line length of 32 bytes, 8 words
• Supports lockdown format C
• Has eight-way associativity, which can be directly mapped
Now, moving on to ARM L2 cache, this cache improves the performance of computer systems when significant memory traffic is generated by the CPU.
Memory access is fastest to the L1 cache, followed closely by the ARM L210™. Memory access is significantly slower to the main memory (L3).
The L2 cache on the ARM1136 core platform is 128 Kbytes.
The L2 cache has a fixed line length of 32 bytes, or 8 words.
The L2 cache supports lockdown format C with separate way locking mechanisms for data and instructions.
The L2 cache has eight-way associativity, which can be directly mapped, depending on the use of lockdown registers.
ARM L2 Cache• Data RAM is byte-writeable.
• L2 cache has support for:– Write Through, read allocate– Write Back, read allocate– Write Back, read and write allocate
• Write allocate override option allows for allocation on write misses in the ARM L210.
• L2 cache performs critical word first refilling, with the option of refilling starting with word 0.
• A pseudo-random victim selection policy can be made deterministic with use of lockdown registers.
• L2 chache has increased performance by 25 to 75 percent, extended battery life, and reduced memory cost.
Continuing with the features of the L2 cache, data RAM is byte-writeable.
The L2 cache has support for the following cache modes: Write Through, read allocate; Write Back, read allocate; and Write Back, read and write allocate.
The write allocate override option allows for always having allocation on write misses in the ARM L210.
The L2 cache performs critical word first refilling, with the option of refilling starting with word 0.
The L2 cache has a pseudo-random victim selection policy, which can be made deterministic with the use of lockdown registers.
The ARM L210 L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, can increase performance by 25 to 75 percent and extend battery life while reducing memory cost. By bringing more data on-chip and closer to the CPU, the ARM L210 L2CC helps remove the performance-limiting bandwidth constraints associated with off-chip memory.
Now let’s move on to the VFP co-processor. The VFP co-processor is an ARM-enhanced IEEE 754 numeric co-processor that supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications.
The VFP co-processor supports high-performance, short-vector operations in registers that can be addressed as short vectors. The VFP co-processor also features a long pipeline for floating-point MAC operations such as decode, issue, execute E1 through to E8, and Write Back. Also featured is a separate divide and square root pipeline that supports load, store, and arithmetic operations in parallel with a divide, square root operation. The VFP reduces the latency impact of these operations.
The VFP includes a separate load, store pipeline feature that enables load and storeoperations to be done in parallel with data processing operations.
For VFP instruction throughput, most single precision data processing operations and double precision data operations have single-cycle execution. Loads are bandwidth balanced to sustain FMAC operations. Two single precision values and one double precision value can be transferred each cycle.
Many calculation functions are supported in hardware, including multiplication, absolute value, and square root. Click this box to see a complete list of calculation functions.
VFP Co-processor
High-performance, short-vector operations• Registers can be addressed as short vectors
Separate load/store pipeline• Load/store operations done in parallel with data processing operations
Separate divide/square root pipeline• Supports load/store, and arithmetic operation in parallel with divide/square root operation• Reduces latency impact of these operations
Long pipeline for floating point MAC operation• Decode- Issue- Execute (E1)- E2- E3- E4- E5- E6- E7- E8- Write Back
Calculation functions supported in hardware
Multiply, add, multiply-add, subtract, multiply-subtract, negate, negate multiply, negate multiply add, negate
multiply-subtract, absolute value, compare, convert, divide and square root, conversions
Single cycle execution• Loads are bandwidth balanced to sustain FMAC operations
Smart Speed Switch
ARM1136 Core Complex
Smart Speed Switch
SSISSI
SIMSIM
SD/MMC x2SD/MMC x2
IIMIIM
UARTUART
CSPICSPI
Mem Stick x 2Mem Stick x 2
One WireOne Wire
AudioMuxAudioMux
SSISSI
KeypadKeypad
UART x 4UART x 4
ATAATA
RTICRTIC
CSPICSPI
ECTECT
I2C x 3I2C x 3
USBOTG /Hosts
USBOTG /Hosts
123
123
SCCSCC
IOMUXCIOMUXC
eDMAeDMA
00
11
22
33
44
00
11
22
33
44
55
ROMC /32K ROMROMC /
32K ROM
RAMC /16K RAMRAMC /
16K RAM
AIPI #1AIPI #1AIPI #1
AIPI #2AIPI #2AIPI #2
L2Cache
L2Cache
ARM1136JFARM1136JF
CSPICSPI
GPIO x 3GPIO x 3
PWMPWM
EPIT x 2EPIT x 2
FIRIFIRI
GPTGPT
WatchdogWatchdog
RNGARNGA
RTCRTC
CCM/CGM/PLLCCM/CGM/PLL
SJCSJC
MPEG4 EncMPEG4 Enc
IPUIPU
EMIEMI
GPUGPU64
64
32
64RAMCRAMC
00
11
00
11
AVICAVICAVIC
The purpose of the Smart Speed switch is to concurrently support up to five simultaneous connections between master devices 0 to 5 and slave devices 0 to 4. It supports 32-bit address bus width and 32-bit data bus width at all master and slave ports. The ARM11 platform implements a six master by five slave configuration. The Smart Speed switch supports two arbitration schemes that are independently programmable for each slave device: the simple fixed-priority algorithm and simple round-robin fairness algorithm.
The Smart Speed switch allows for concurrent transactions to occur from any master device to any slave device. It is possible for five master devices and all slave devices to be in use at the same time due to independent requests. The Smart Speed switch can gain control of the slave devices and prevent any masters from making any accesses to the slave devices. This is useful if the user wishes to turn off all the clocks and ensure that no bus activity will be interrupted. The Smart Speed switch can put each slave port in low power park mode so the slave will not dissipate any power when not being accessed by a master port.
Question
Which cache write strategy is illustrated in this graphic? Select the response that applies and click Done.
a. Write Back
b. Rewrite
c. Write Through
d. Read/Write
ExternalMemory
CPU
Cache
WriteBuffer
L2MemorySystem
CPU
Cache
WriteBuffer
Done
Let’s see if you can remember the cache write strategies.
Correct.
Cache write strategies consist of Write Through and Write Back. The cache write strategy shown here is Write Through. For Write Through, if the location is within the cache, the cache is updated and write is also sent to memory via the Write Buffer. For Write Back, if the location is within the cache, only the cache is updated.
QuestionWhich of the following statements about the CPU complex of the i.MX31are correct? Select all that apply and then click Done.
The Smart Speed switch concurrently supports up to 5 simultaneous connections between master devices and slave devices.
The L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, do not increase performance.
The VFP co-processor supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications.
The L1 cache improves the performance of computer systems when significant memory traffic is generated by the CPU.
Done
Please select all the statements that accurately describe aspects of the I.MX31 CPU complex.
Correct.
The purpose of the Smart Speed switch is to concurrently support up to 5 simultaneous connections between master devices and slave devices. The ARM L210 L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, can increase performance by 25 to 75 percent and extend battery life. The VFP co-processor supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications. The L2 cache improves the performance of computer systems when significant memory traffic is generated by the CPU.
Module Summary
• ARM1136 core platform
• ARM1136JF-S processor
• ARM V6 architecture
• Caches in the CPU complex
• Cache write strategies
• ARM Level 2 cache
• VFP Co-processor
• Smart Speed switch
In this module, you learned about the various components of the i.MX31 CPU complex. First, you learned about the ARM1136 core platform, ARM1136JF-S processor, and the benefits of ARM V6 architecture, which include increased CPU efficiency and performance and multimedia performance. Next, you learned about the two levels of caches in the CPU complex: L1 and L2. Specifically, you learned about cache write strategies and the ARM Level 2 cache. Finally, you learned about the VFP co-processor, which supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications, and the Smart Speed switch, which can concurrently support up to five simultaneous connections between master devices and slave devices.