Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | dylan-murphy |
View: | 219 times |
Download: | 0 times |
© 2010 Renesas Electronics America Inc. All rights reserved.
131L: Optimizing RX Performance
John Breitenbach
President, Atlantex Corp.
14 October 2010
Version: 1.4
2 © 2010 Renesas Electronics America Inc. All rights reserved.
John Breitenbach
President – Atlantex Corp. Contract Embedded Systems Design (Est. 1998) Renesas Alliance Partner Author of RX QDG, porting guides, app notes, demo code
Embedded “cred”: 25+ years embedded systems development Dick Cheney’s UPS Remote control cows
Geek “cred” Patent #7,054,045 for Holographic HMI First computers:
– Atari 800– Timex Sinclair 1000
– 16K!
3 © 2010 Renesas Electronics America Inc. All rights reserved.
Renesas Technology and Solution Portfolio
Microcontrollers& Microprocessors
#1 Market shareworldwide *
Analog andPower Devices#1 Market share
in low-voltageMOSFET**
Solutionsfor
Innovation
Solutionsfor
InnovationASIC, ASSP& Memory
Advanced and proven technologies
* MCU: 31% revenue basis from Gartner "Semiconductor Applications Worldwide Annual Market Share: Database" 25 March 2010
** Power MOSFET: 17.1% on unit basis from Marketing Eye 2009 (17.1% on unit basis).
4 © 2010 Renesas Electronics America Inc. All rights reserved.
4
Renesas Technology and Solution Portfolio
Microcontrollers& Microprocessors
#1 Market shareworldwide *
Analog andPower Devices#1 Market share
in low-voltageMOSFET**
ASIC, ASSP& Memory
Advanced and proven technologies
* MCU: 31% revenue basis from Gartner "Semiconductor Applications Worldwide Annual Market Share: Database" 25 March 2010
** Power MOSFET: 17.1% on unit basis from Marketing Eye 2009 (17.1% on unit basis).
Solutionsfor
Innovation
Solutionsfor
Innovation
5 © 2010 Renesas Electronics America Inc. All rights reserved.
5
Microcontroller and Microprocessor Line-up
Superscalar, MMU, Multimedia Up to 1200 DMIPS, 45, 65 & 90nm process Video and audio processing on Linux Server, Industrial & Automotive
Up to 500 DMIPS, 150 & 90nm process 600uA/MHz, 1.5 uA standby Medical, Automotive & Industrial
Legacy Cores Next-generation migration to RX
High Performance CPU, FPU, DSC
Embedded Security
Up to 10 DMIPS, 130nm process350 uA/MHz, 1uA standbyCapacitive touch
Up to 25 DMIPS, 150nm process190 uA/MHz, 0.3uA standbyApplication-specific integration
Up to 25 DMIPS, 180, 90nm process 1mA/MHz, 100uA standby Crypto engine, Hardware security
Up to 165 DMIPS, 90nm process 500uA/MHz, 2.5 uA standby Ethernet, CAN, USB, Motor Control, TFT Display
High Performance CPU, Low Power
Ultra Low PowerGeneral Purpose
6 © 2010 Renesas Electronics America Inc. All rights reserved.
6
Microcontroller and Microprocessor Line-up
Superscalar, MMU, Multimedia Up to 1200 DMIPS, 45, 65 & 90nm process Video and audio processing on Linux Server, Industrial & Automotive
Up to 500 DMIPS, 150 & 90nm process 600uA/MHz, 1.5 uA standby Medical, Automotive & Industrial
Legacy Cores Next-generation migration to RX
High Performance CPU, FPU, DSC
Embedded Security
Up to 10 DMIPS, 130nm process350 uA/MHz, 1uA standbyCapacitive touch
Up to 25 DMIPS, 150nm process190 uA/MHz, 0.3uA standbyApplication-specific integration
Up to 25 DMIPS, 180, 90nm process 1mA/MHz, 100uA standby Crypto engine, Hardware security
Up to 165 DMIPS, 90nm process 500uA/MHz, 2.5 uA standby Ethernet, CAN, USB, Motor Control, TFT Display
High Performance CPU, Low Power
Ultra Low PowerGeneral Purpose
RX
Ethernet, CAN, USB, UART, SPI, IIC
7 © 2010 Renesas Electronics America Inc. All rights reserved.
Innovation
8 © 2010 Renesas Electronics America Inc. All rights reserved.
The RX Solution
Renesas Extreme RX architecture provides you best in class
performance, with a rich set of intelligent peripherals enabling
you to create innovative, interactive, connected devices.
9 © 2010 Renesas Electronics America Inc. All rights reserved.
Agenda
Presentation: RX High-Performance Architecture Core Instruction set Peripherals
Lab: Measure & Maximize RX Performance Basic benchmarking Improve a real application
Q & A
10 © 2010 Renesas Electronics America Inc. All rights reserved.
Key Takeaways
By the end of this session you will be able to:
Perform a basic benchmark of the RX
Profile critical sections with the RX on-chip debug
Maximize your code’s performance with smart peripherals
11 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Architecture: Enabling High-Performance
12 © 2010 Renesas Electronics America Inc. All rights reserved.
RX600 CISC CPU5-STAGE PIPELINE
5 STAGES OF PIPELINE
F = FETCH INSTRUCTION
D = DECODE INSTRUCTION
E = EXECUTE INSTRUCTION
M = READ OR WRITE MEMORY
W = WRITE BACK TO REGISTER
Inst64bit Instructions
Data32bit
Operands (Data)
ENHANCED HARVARD ARCHITECTURE WRITE BUFFER
For Slower Memory Typically SRAM
Typically Flash Memory
PRE-FETCH QUEUE (PFQ)
Holds 4 to 32 Instructions for Slower Memory Memory Interface
64
32
100MHz CPU Core
16 x 32 bit General Purpose Registers
9 x 32 bit Control
Registers
RX Architecture … CPU Core and Pipeline
32 bit Floating Point
Unit
32 x 32 MAC to 48 bit or 80 bit Result
32 x 32 DIV or MULT 32 bit or 64 bit Result
Memory Protection
Unit
Interrupt Control
On-Chip Debug
ENHANCED HARVARD ARCHITECTURE
5-STAGE PIPELINE
64
bits
64
bits
64
bits
64
bits
Buffer Only for Writes
F D E M W
TIC
K
F D
F
TIC
K
E
D
F
TIC
K
M
E
D
F
TIC
K
W
M
E
D
F
TIC
K
F
W
M
E
D
TIC
K
D
F
W
M
E
TIC
K
E
D
F
W
M
TIC
K
M
E
D
F
W
TIC
K
EE
EE
E
W
M
E
D
F
Achieves One Clock-Per-Instruction (CPI)
13 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Architecture … Memory Interface
SRAM, 100MHz Access
64 bits
Flash Memory, 100MHz Access
64 bits
100 MHz Flash and SRAM means zero wait-state code and data access
PFQ minimizes stalls from slower memory, such as external memory
CPU is bus master of Internal Bus 1 Internal Bus 2 connects to peripherals…
External Bus Pins
for CPU
External Bus
Controller (BSC)
32 bits
Internal Main Bus 132 bits
32 bits
Bus Bridge
Peripherals
RX600 MCU
RX600 CPU
100MHz
PIPELINE PFQ
BUFFER
64b INST
32b DATA
Bus Master of Internal Main Bus 1
BUS MATRIXAllows CPU to concurrently fetch Instructions or access Data from any of 3 sources:
• Flash Memory• SRAM• Internal Main Bus 1
14 © 2010 Renesas Electronics America Inc. All rights reserved.
Multiple Peripheral Busses to Spread Bandwidth Loading
CN
TL
CN
TL
CN
TL
Communication (USB, CAN, SCI, SPI, I2C)
Timers (MTU, TPU, TMR, CMT)
Analog (DAC, ADC, PGA) GPIO
System Control (DMA, E2P, ICU, LVD, RTC, WDG,
CLKS) Ethernet MAC
Internal Main Bus 232 bits
DTC (bus master)
Bus Bridge
DMAC (bus master)
Ethernet DMAC (bus
master)
RX Architecture … System Interface
RX600 CPU
100MHz
PIPELINE PFQ
BUFFER
64b INST
32b DATA
External Bus Pins
for CPU
Bus Master of Internal Main Bus 1
64 bits
64 bits
Bus Bridge
EXDMA (external bus master)
32 bits
Internal Main Bus 132 bits
32 bits
RX600 MCU BUS MATRIXAllows CPU to concurrently fetch Instructions or access Data from any of 3 sources:
• Flash Memory• SRAM• Internal Main Bus 1 SRAM,
100MHz AccessFlash Memory, 100MHz Access
External Bus
Controller (BSC)
On
e E
xter
nal
Dev
ice
An
oth
er E
xter
nal
Dev
ice
15 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Floating Point Unit
16 © 2010 Renesas Electronics America Inc. All rights reserved.
Question: Who Said It?
“Microprocessor manufacturers unfortunately seem to feel that floating-point math is not very important in embedded systems.
This has not been my experience.”
Jean Labrosse, PresidentMicriumAuthor, μC/OS operating system
17 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Floating Point Unit IEEE 754 single precision 32 bits data
format
Subtract, Multiply, Divide and Integer Conversion directly from CPU registers
IEEE 754 Exceptions
Floating Point Instructions:
• FADD - Floating-point ADD
• FCMP - Floating-point COMPare
• FDIV - Floating-point DIVide
• FMUL - Floating-point MULtiply
• FSUB - Floating-point SUBtract
• FTOI - Float TO Integer
• ITOF - Integer TO Floating-point
• ROUND - ROUND floating-point to integer
R0 (SP )R1R2R3R4R5R6R7R8R9
R10R11R12R13R14R15
FloatingPoint Unit
410410000
Memory map
Example:
MOV.L R3,R4FMUL #4104100000H,R4
8 tap FIR: 0.949 uS
18 © 2010 Renesas Electronics America Inc. All rights reserved.
Under the Hood: FPU Code Generation
Sample floating point operation: temperature conversion
float Degrees_C, Degrees_F ;
Degrees_C = 22.1 ;Degrees_F = (Degrees_C * 9.0/5.0) + 32.0;
Variables: single precision floats
Floating point constants
Floating point operations
19 © 2010 Renesas Electronics America Inc. All rights reserved.
Under the Hood: FPU Code Generation
Code emitted by compiler
float Degrees_C, Degrees_F ;
Degrees_C = 22.1 ;Degrees_F = (Degrees_C * 9.0/5.0) + 32.0;
Degrees_C = 22.1 ;MOV.L #41B0CCCDH,R3
Degrees_F = (Degrees_C * 9.0/5.0) + 32.0;MOV.L R3,R4FMUL #41100000H,R4FDIV #40A00000H,R4FADD #42000000H,R4
Constants stored in IEEE 754 format
RX floating point instructions…
…operate directly on registers & memory
20 © 2010 Renesas Electronics America Inc. All rights reserved.
Under the Hood: FPU Code Generation
_COM_DIVf MOV.L R1,R15 XOR R2,R15 SHLL #1,R1 MOV.L R1,R3 SHLR #24,R3 SHLL #8,R1 SHLL #1,R2 MOV.L R2,R4 SHLR #24,R4 SHLL #8,R2 CMP #0FFH,R3 BEQ.W exception1 CMP #0FFH,R4 BEQ.W exception2 CMP #0H,R3 BEQ.W exception3 exception_return3 CMP #0H,R4 BEQ.B exception4 exception_return4 SUB R4,R3 ADD #7FH,R3,R3 OR #1H,R1 ROTR #1,R1 RORC R2
Comparison: 1 FPU instruction = 100+ SW instructions
Degrees_C = 22.1 ;MOV.L #41B0CCCDH,R3
Degrees_F = (Degrees_C * 9.0/5.0) + 32.0;MOV.L R3,R4FMUL #41100000H,R4FDIV #40A00000H,R4FADD #42000000H,R4
SHLR #1,R1 MOV.L #0H,R5 MOV.L #0H,R4 MOV.L #1AH,R14 BRA.S div_loop_entry div_loop SHLL #1,R5 BTST #0,R4 BNE.S div_loop_entry BSET #0,R5div_loop_entry SHLL #1,R1 ROLC R4 BTST #1,R4 BNE.S div_1 SUB R2,R1 BC.B div_2 XOR #01H,R4 BRA.S div_2 div_1 ADD R2,R1 BNC.B div_2 XOR #01H,R4div_2 SUB #1H,R14 BNE.B div_loop div_loop_exit AND #1H,R4
BEQ.S make_result ADD R2,R1make_result MOV.L R5,R2 SHLL #1,R2 XOR #01H,R4 OR R4,R2 SHLL #6,R2 CMP #0H,R1 BEQ.S end_calc_sticky OR #20H,R2end_calc_sticky MOV.L R2,R4 BTST #31,R4 BNE.S end_normalize SHLL #1,R2 SUB #1H,R3end_normalize CMP #0FFH,R3 BLT.B 0FFFF885EH BRA.W return_inf CMP #-17H,R3 BGE.B 0FFFF8866H BRA.W return_zero CMP #0H,R3 BGT.B end_denormal
denormalize_loop SHLR #1,R2 BNC.B next_loop BSET #0,R2next_loop CMP #0H,R3 BGE.B round_denormal ADD #1H,R3 BRA.B denormalize_loop round_denormal BTST #7,R2 BEQ.B end_round_d MOV.L R2,R4 AND #017FH,R4 BEQ.S end_round_d ADD #0100H,R2,R2 BPZ.B end_round_d ADD #1H,R3end_round_d BRA.B end_round end_denormal BTST #7,R2 BEQ.B end_round MOV.L R2,R4 AND #017FH,R4 BEQ.B end_round ADD #0100H,R2,R2 BNC.B end_round
round_carry RORC R2 ADD #1H,R3end_round SHLL #1,R2 SHLR #8,R2 SHLL #24,R3 OR R2,R3 MOV.L R3,R1 SHLL #1,R15 RORC R1 RTS
21 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Instruction Set
22 © 2010 Renesas Electronics America Inc. All rights reserved.
Question:
What programming language do you use?
If it doesn’t run Python, I won’t use it C/C++ Real men program in assembler No, real men use a hex editor & opcodes I once programmed a database using only 1’s & 0’s You had 0’s !?!?
23 © 2010 Renesas Electronics America Inc. All rights reserved.
Another Question…
How big is your code?
<10K 10K - 64K 64K – 128K Under one megabyte Under 4 Megabytes I write code for Microsoft…
and own stock in Seagate!
24 © 2010 Renesas Electronics America Inc. All rights reserved.
Instruction Set
Target: Improve code density, support for High Level Langs
Analyze code from real-world customer applications Adopt variable byte-length instruction Assign most used instructions to short instruction codes Add addressing modes Benchmark, refine, benchmark, refine… Result:
30% Code Size Reduction
Data communication
1.0
Code size (relative)
Motor control
Data conversion
Real-time control
28% less
= RX600
= Cortex-M3 based MCU
System control
19% less
17% less
25% less
25% less
Note: Internal benchmark test, your results may vary
25 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Smart Peripherals
26 © 2010 Renesas Electronics America Inc. All rights reserved.
Why Smart Peripherals?
“I don’t care what it is,
when it has an LCD screen, it makes it better.”
Kevin Rose, Diggnation
27 © 2010 Renesas Electronics America Inc. All rights reserved.
Peripherals
Targets: Offload the CPU, ease migration, reduce power
Cherry pick from extensive portfolio Add intelligent DMAC, Data Transfer Controller, new Timers,
Ethernet, USB, CAN
Result:
Only 5% CPU loading for 60 Hz refresh of static image
28 © 2010 Renesas Electronics America Inc. All rights reserved.
DMAC vs Data Transfer Control (DTC)
Similarities Registers (SAR, DAR, Xfer & Block Count) Byte, words, long words Auto Increment/Decrement SAR/DAR Normal/Repeat/Block modes Interrupt generation
Differences DMA faster, 1 transfer/cycle DMA dedicated registers for each channel DMA channels limited DTC many virtual channels DTC channels can be chained DTC much more flexible
29 © 2010 Renesas Electronics America Inc. All rights reserved.
Lab Technique: Measuring Performance
30 © 2010 Renesas Electronics America Inc. All rights reserved.
Measuring System Performance
Performance Counters Hardware supported high-resolution timer Counts execution cycles & number of passes for two sections of
user code No affect on your code Two 32-bit timers or one 64-bit timer Triggered by complex events Selectively record all executions cycles, interrupts, exceptions
31 © 2010 Renesas Electronics America Inc. All rights reserved.
Performance Analysis Setup
32 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Performance Labs: Goals
RX Core Benchmarking Dhrystone
Optimize a real application Benchmark application – polled mode Use timers & interrupts Use DMAC to read & buffer ADC readings
“90% CPU loading doubles the schedule, 95% triples it.”
Alan M. Davis“201 Principles of Software Development”
33 © 2010 Renesas Electronics America Inc. All rights reserved.
Start your lab!
34 © 2010 Renesas Electronics America Inc. All rights reserved.
Start the Lab
Keep your dice turned to the section of the lab you are on. (Instructionsare provided in the lab handout)
Please refer to the Lab Handout and let’s get started!
34
35 © 2010 Renesas Electronics America Inc. All rights reserved.
Checking Progress
We are using the die to keep track of where everyone is in the lab. Make sure to update it as you change sections.
When done with the lab, your die will have the 6 pointing up as shown here.
35
36 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Performance Labs: Review
RX Core Benchmarking Dhrystone: 1.65 DMIPS/MHz Performance scales linearly to 100 MHz
37 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Performance Labs: Sample Application
Collect samples from ADC Scale reading to J thermocouple range, convert °C
Version 1: Polled No optimization, do everything as fast as possible
Version 2: Interrupts Post-process while ADC is sampling One interrupt/sample
Version 3: DMAC DMAC buffers 5,000 readings One interrupt/buffer
38 © 2010 Renesas Electronics America Inc. All rights reserved.
Checking Progress
We are using the die to keep track of where everyone is in the lab. Make sure to update it as you change sections.
When done with the lab, your die will have the 6 pointing up as shown here.
38
39 © 2010 Renesas Electronics America Inc. All rights reserved.
RX Performance Labs: Review
Reduce CPU overhead w/ smart peripherals Benchmark: 100% of CPU @ 330 kHz Timer & interrupts: 97% of CPU @ 500 kHz Plus DMAC: 71% of CPU @ 500 kHz Take advantage of smart peripherals!
40 © 2010 Renesas Electronics America Inc. All rights reserved.
Questions?
41 © 2010 Renesas Electronics America Inc. All rights reserved.
Innovation
42 © 2010 Renesas Electronics America Inc. All rights reserved.
Feedback Form
Please fill out the feedback form! If you do not have one, please raise your hand
43 © 2010 Renesas Electronics America Inc. All rights reserved.
Thank You!