ENHANCEMENTS TO RISC ARCHITECTURE FOR
PORTABLE EMBEDDED SYSTEMS
A THESIS
Submitted by
B. GOVINDARAJALU
Under the guidance of
Dr. K.M. MEHATA
in partial fulfilment for the award of the degree of
DOCTOR OF PHILOSOPHY
in
COMPUTER SCIENCE AND ENGINEERING
B.S.ABDUR RAHMAN UNIVERSITY
(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956)
www.bsauniv.ac.in
OCTOBER 2014
ENHANCEMENTS TO RISC ARCHITECTURE
FOR PORTABLE EMBEDDED SYSTEMS
A THESIS
Submitted by
B. GOVINDARAJALU
Under the guidance of
Dr. K.M. MEHATA
in partial fulfilment for the award of the degree of
DOCTOR OF PHILOSOPHY
in
COMPUTER SCIENCE AND ENGINEERING
B.S.ABDUR RAHMAN UNIVERSITY
(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956)
www.bsauniv.ac.in
OCTOBER 2014
ii
B.S.ABDUR RAHMAN UNIVERSITY (B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)
(Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in
BONAFIDE CERTIFICATE
Certified that this thesis ENHANCEMENTS TO RISC
ARCHITECTURE FOR PORTABLE EMBEDDED SYSTEMS is the bonafide
work of B.GOVINDARAJALU (RRN: 1186221) who carried out the thesis
work under my supervision. Certified further, that to the best of my
knowledge the work reported herein does not form part of any other thesis or
dissertation on the basis of which a degree or award was conferred on an
earlier occasion on this or any other candidate.
SIGNATURE Dr. K.M.MEHATA RESEARCH SUPERVISOR Professor & Dean Department of CSE B.S. Abdur Rahman University Vandalur, Chennai – 600 048
SIGNATURE Dr. SHARMILA SANKAR
HEAD OF THE DEPARTMENT Professor & Head
Department of CSE B.S. Abdur Rahman University Vandalur, Chennai – 600 048
iii
ACKNOWLEDGEMENT
This thesis would not have been possible without the help and
support of many people. First, I would like to thank my research supervisor
Dr. K.M. Mehata, Professor and Dean, School of Computer, Information and
Mathematical Sciences of B.S. Abdur Rahman University, for all his inspiring
ideas and support. It is he who motivated me to join the PhD programme
when I approached him four years back with lot of doubts in my mind.
I thank Dr. Sharmila Sankar, Professor and Head, Dr. Angelina
Geetha, Professor, and other staff members of the department of Computer
Science and Engineering, B.S. Abdur Rahman University, for their support
and encouragement. I thank the Doctoral Committee Members,
Dr. V. Sankaranarayanan, Professor of Eminence, B.S. Abdur Rahman
University and Dr. Ranjani Parthasarathi, Professor, Information and
Communication Engineering, Anna University, for their review comments. I
express my gratitude to the Chancellor, Vice Chancellor, Pro-Vice
Chancellor and Registrar for giving me an opportunity to do research at
B.S. Abdur Rahman University.
I thank the managements of four engineering colleges - Rajalakshmi
Engineering College, Sri Ramanujar Engineering College, Dhanalakshmi
College of Engineering, and Sri Venkateswara College of Engineering -
where I have served during my research work, for supporting me. I thank
following professionals whom I had consulted for my requirements: Raju
Sambandam, Ilanthirayan Singaram, Ramkumar, and Shyamala
Dharmar. I thank my ex-colleagues Kohila, Prof. Ramakrishnan and
Prof.Sivakumar and ex-students Haripriya, Vinodhini, Nandhini and
Abinaya who have helped me at various stages of my research work.
The persons who consistently supported me but also suffered most
during my research work are my family members - wife Bhuvaneswari, son
Krishnakumar, daughter-in-law Manjula, daughter Padma, and son-in-law
G.K. Ananth. Finally, I want to mention my grand children - Vihaan,
Sahahsra and Anvitha - who provided both distraction and relaxation.
B. GOVINDARAJALU
iv
ABSTRACT
The proposed research work focuses on developing a flexible
Instruction Set Architecture (ISA) by modifications to the Reduced Instruction
Set Computing (RISC) architecture to minimise code memory in portable
embedded systems. When the RISC architecture was introduced in 1980's,
the program memory was external to the processor. As the present day
embedded system is available as single System-On-a-Chip (SoC), there is
need for an ISA that contributes to overall benefits to the SoC in terms of
chip space, power consumption and cost. Though there are many code
reduction methods, there are very few ISA level techniques that aim at
modern embedded SoCs in which the code memory occupies a large part of
the silicon area due to the Fixed Instruction Encoding (FIE) feature of RISC.
This thesis proposes replacing FIE with Hybrid Instruction Encoding (HIE) for
MIPS32 like RISC processors to support multiple instruction sizes, and
hybrid lengths for offset and immediate fields so as to reduce wastage of
memory. The proposed solution eliminates additional code compression
efforts on the part of the system developers. Suitable modifications to
MIPS32 ISA have been developed and experimented using MiBench and
MediaBench suites. A code analysis cum conversion suite has been
developed as part of the research work. Further, a set of compound and
composite instructions has been introduced enhancing the code size
reduction. In addition to HIE, the research work also explores supporting
the Register Memory Architecture with new instructions. The results show
reduction of code memory ranging from 22% to 44% that is significant in
battery operated portable applications such as wearable devices, and
implantable medical devices for which processor performance is not critical.
The final part of the thesis focuses on the adoption of Heads-and-Tails
format to take advantage of the high code density of the hybrid-length
instructions while enabling deeply pipelined or superscalar processors.
v
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO
ACKNOWLEDGEMENT iii
ABSTRACT iv
LIST OF TABLES xii
LIST OF FIGURES xiv
LIST OF SYMBOLS AND ABBREVIATIONS xix
1. PORTABLE EMBEDDED SYSTEMS AND
CODE SIZE 1
1.1 EMBEDDED SYSTEMS 2
1.1.1 Characteristics of Embedded Systems 5
1.1.2 Architecture of Embedded Systems 8
1.1.3 Embedded Software 9
1.2 SoC ARCHITECTURE 13
1.3 BOPES DESIGN PHILOSOPHY AND
PROCESSOR TECHNOLOGY 15
1.3.1 IC Technology 18
1.4 PROCESSOR ARCHITECTURES AND
INSTRUCTION ENCODING 19
1.4.1 CISC Vs RISC 19
1.4.2 Load - Store Architecture (LSA) 23
1.4.3 Fixed, Variable and Hybrid length encoding 25
1.5 MOTIVATION 27
1.6 RESEARCH OBJECTIVES 31
1.7 CONTRIBUTIONS 32
1.8 THESIS OVERVIEW 33
2 BACKGROUND AND RELATED WORK 35
2.1 DESIGN FOR LOW POWER CONSUMPTION 35
vi
CHAPTER NO. TITLE PAGE NO
2.2 INSTRUCTION SET ARCHITECTURE (ISA) 38
2.2.1 Instruction Types and Operations 39
2.2.2 Operation Codes 44
2.2.3 Addressing modes 45
2.2.4 Data types 49
2.2.5 ISA Models 49
2.3 PROCESSOR PERFORMANCE AND
ADVANCED ARCHITECTURES 50
2.3.1 Instruction Pipelining 53
2.3.2 RISC Instructions and Pipelining 54
2.3.3 Superscalar processor 60
2.3.4 Very Long Instruction Word (VLIW)
Processor 60
2.3.5 Cache Memory 63
2.3.6 Virtual Memory 65
2.3.7 Multicore CPU 66
2.4 EMBEDDED PROCESSORS 68
2.5 EMBEDDED SYSTEM ARCHITECTURES 71
2.5.1 Digital Signal Processor 73
2.5.2 Media Extensions 74
2.5.3 Embedded Multiprocessors 75
2.6 MIPS32 Vs OTHER RISC PROCESSORS 77
2.6.1 CISC and RISC Convergence 78
2.7 MIPS32 INSTRUCTIONS AND
CODE WASTAGE 80
2.8 CODE SIZE REDUCTION IN EMBEDDED
SYSTEMS 82
2.8.1 Code Compression 83
2.8.2 Dictionary-based Compression 85
2.8.3 Compiler Techniques 88
vii
CHAPTER NO. TITLE PAGE NO
2.8.4 Ad hoc ISA Modification 89
2.9 ISA LEVEL CODE SIZE REDUCTION 90
2.10 CONCLUSIONS 91
3 BEHAVIOUR OF EMBEDDED CODES FOR RISC 93
3.1 MIBENCH BENCHMARKS 94
3.2 MEDIABENCH BENCHMARKS 98
3.3 MIMEDIA BENCHMARK SUITE 102
3.4 TYPICAL BEHAVIOUR OF EMBEDDED
APPLICATIONS 104
3.5 CONCLUSIONS 128
4 HYBRID INSTRUCTION ENCODING FOR
RISC CORES 129
4.1 MIPS ISA AND CODE WASTAGE 130
4.1.1 MIPS Instruction Set 130
4.1.2 MIPS Instruction Format 131
4.1.3 Wastage in MIPS32 Code 139
4.2 HIE1 METHODOLOGY FOR MIPS32 144
4.2.1 HIE1 RISC Instructions 145
4.2.2 Mapping MIPS32 ISA to HIE1 147
4.3 HIE1 EXPERIMENTAL RESULTS 150
4.3.1 Drawback of register size reduction 154
4.4 DESIGN OF HIE2 155
4.4.1 Impact of Reduction of immediate
and offset lengths to 15 bits 156
4.4.2 HIE2 Design for MIPS32 156
4.5 DISCUSSION ON HIE2 RESULTS 160
4.5.1 Reduction in Memory Accesses in HIE 162
4.5.2 Reduction in Redundant zeros in HIE 166
viii
CHAPTER NO. TITLE PAGE NO
4.6 PROCESSOR MODIFICATIONS TO
SUPPORT HIE 170
4.7 CONCLUSIONS 172
5 REGISTER MEMORY ARCHITECTURE FOR
RISC CORES 174
5.1 MOTIVATION FOR RMA 174
5.2 METHODOLOGY FOR RMA ALU
INSTRUCTIONS 176
5.2.1 Formats for ADDrm Instruction for MIPS 177
5.2.2 Proposed RMA Opcodes 179
5.2.3 Estimates on Code Size Reduction 184
5.3 RESULTS AND DISCUSSION 185
5.4 PIPELINE MODIFICATIONS FOR RMA 190
5.5 CONCLUSIONS 194
6 HYBRID PROCESSOR FOR PORTABLE
EMBEDDED SYSTEMS 195
6.1 SoC DESIGN AND EMBEDDED SYSTEMS 196
6.1.1 Smart watch 196
6.1.2 Scanner 197
6.1.3 Smartphones 199
6.2 ENHANCEMENTS TO HIE AND RMA CODES 200
6.3 FUTURE ENHANCEMENT TO HIE-RMA 202
6.4 HIE AND ILP 202
6.4.1 Hybrid-Length Instructions and
Instruction fetch 203
6.4.2 Instruction Fetch and Cache Access 204
6.5 HIE-MIPS Vs microMIPS/Thumb2 208
ix
CHAPTER NO. TITLE PAGE NO
6.6 DISCUSSION AND CONCLUSION 210
6.6.1 Summary of Contributions 212
6.6.2 Limitations of Described Research Work 213
6.6.3 Areas for Future Work 213
6.6.3.1 MicroMIPS and Thumb2
versions with HIE-RMA 214
6.6.3.2 Reconfigurable HIE-RMA version 214
6.6.3.3 Dynamic simulation 214
6.6.3.4 FPGA Processor design 214
6.6.3.5 Compiler tool chain 214
6.6.3.6 HIE-RMA-HAT Processor 214
REFERENCES 215
LIST OF PUBLICATIONS 221
APPENDIX 1 222
(MIDACC ARCHITECTURE)
A1.1 INTRODUCTION 222
A1.2 MIDACC INTERNALS 222
A1.2.1 MIDA Internals 223
A1.2.1.1 Instruction class distribution 223
A1.2.1.2 Instruction distribution 223
A1.2.1.3 MIPS Code redundant 0’s
Distribution 224
A1.2.1.4 Branch instruction distribution 224
A1.2.1.5 WASTIO Calculation 225
A1.2.1.6 Population of FTFI 225
A1.2.1.7 Registers usage behaviour 226
A1.2.1.8 Shift length usage 227
x
CHAPTER NO. TITLE PAGE NO
A1.2.1.9 Immediate field usage pattern 228
A1.2.1.10 Offset field usage pattern 229
A1.2.2 MICC Internals 229
A1.2.2.1 HIE1 code conversion 229
A1.2.2.2 HIE2 code conversion 230
A1.2.2.3 RMA Code Conversion 231
A1.2.2.4 RMA+HIE1 code conversion 234
A1.2.2.5 RMA+HIE2 code conversion 234
A1.3 MIDACC EXTENDER 234
APPENDIX 2 235
(MIDACC USER GUIDE)
A2.1 INTRODUCTION 235
A2.2 INSTALLING MIDACC 235
A2.3 INPUT FORMAT REQUIRED BY MIDACC 239
A2.4 USING MIDACC 240
A2.4.1 MIDA Tab 240
A2.4.2 MICC Tab 242
A2.4.2.1 HIE1 code conversion 243
A2.4.2.2 HIE2 code conversion 243
A2.4.2.3 RMA Code conversion 244
A2.4.2.4 RMA+HIE1 code conversion 245
A2.4.2.5 RMA+HIE2 code conversion 245
A2.5 SAMPLE OUTPUT OBTAINED USING MIDACC 246
A2.5.1 Code Analysis Report 246
A2.5.2 HIE1 Code Conversion Report 256
A2.5.3 HIE2 Code Conversion Report 257
A2.5.4 RMA Code Conversion Report 259
A2.5.5 RMA+HIE1 Code Conversion Report 260
A2.5.6 RMA+HIE2 Code Conversion Report 261
xi
CHAPTER NO. TITLE PAGE NO
A2.6 CROSS COMPILATION PROCEDURE 261
A2.6.1 Using Sourcery Codebench for
Cross Compilation 262
A2.6.1.1 Building the C program 262
A2.6.1.2 Obtaining the assembly code 262
APPENDIX 3 264
(MIPS32 INSTRUCTION IDENTIFICATION TABLE)
APPENDIX 4 269
(HIE1-MIPS INSTRUCTION MAP)
APPENDIX 5 273
(HIE2-MIPS INSTRUCTION MAP)
A5.1 HIE2-MIPS INSTRUCTION MAP 273
A5.2 HYBRID LENGTH FIELDS ENCODING 276
TECHNICAL BIOGRAPHY 278
xii
LIST OF TABLES
TABLE NO. TITLE PAGE NO.
1.1 Examples of Embedded Systems 3
1.2 Design Metrics for Embedded Systems 6
1.3 Types of Software Components in Embedded Systems 11
1.4 Typical SoCs and Applications 15
1.5 Extent of Data Transfer Instructions in CISC and RISC 24
2.1 Sample Instructions and processor actions 40
2.2 Addressing modes and mechanisms 45
2.3 Instruction cycle steps and actions for ADD instruction 51
2.4 Sample micro-operations 52
2.5 Typical instruction cycle phases in RISC processors 58
2.6 Typical Embedded Architectures and Processors 69
2.7 Processor types in Complex Embedded Systems 70
2.8 Typical Wastage of Bits in MIPS32 Instructions 81
3.1 MiBench Benchmarks 95
3.2 MediaBench Benchmarks 99
3.3 Embedded Applications for MiMedia Suite 103
3.4 Four types of offset / immediate byte patterns 117
3.5 Trends in Embedded Applications: WASTIO components 119
3.6 Code bloat factors for Embedded object codes for MIPS32 127
4.1 MIPS Instruction Formats 131
4.2 MIPS Instruction Fields 132
4.3 MIPS32 Integer instructions and actions 134
4.4 MIPS32 Instructions, opcodes and redundant zeros 140
4.5 Sample Encoding of Offset/Immediate Field in HIE-MIPS 147
4.6 MIPS32 ISA to HIE1 RISC ISA Mapping 148
4.7 Mapping MIPS32 ISA to HIE2 ISA 160
4.8 Comparison of Code reduction schemes HIE1 and HIE2 161
4.9 Average Instruction Size in HIE 164
xiii
TABLE NO. TITLE PAGE NO.
4.10 Memory Cycles for Instruction Fetch in HIE 165
4.11 Comparison of RZs of Embedded Applications 167
4.12 Typical Code Size Reduction of Embedded
Applications in HIE 168
4.13 Relationship between RZ, WASTIO AND HIE PCR 169
5.1 Data transfer Vs Arithmetic Instructions in
Embedded Applications 175
5.2 MIPS ADD Instructions 177
5.3 Proposed RMA ADD Instructions for MIPS 178
5.4 Types of ALU instruction for RMA load sequence 181
5.5 Types of ALU instruction for RMA store sequence 181
5.6 RMA Instructions Corresponding to MIPS32
Instructions for Load Sequence 183
5.7 RMA Instructions Corresponding to MIPS32
Instructions for Store Sequence 183
6.1 Impact of Integration of Code reduction schemes
HIE2 and RMA 201
A3.1 MIPS32 Instruction Identification Table 264
A4.1 HIE1-MIPS instruction map 269
A4.2 IID Field Encoding 272
A5.1 HIE2-MIPS INSTRUCTION MAP 273
A5.2 hl Encoding for G1 Type Instructions 276
A5.3 hl Encoding for G2 Type Instructions 277
A5.4 hl Encoding for G3 Type Instructions 277
xiv
LIST OF FIGURES
FIGURE NO. TITLE PAGE NO.
1.1 Block diagram of a pacemaker 4
1.2 Embedded Systems Model 8
1.3 Software Components in Embedded Systems 10
1.4 Organisation of an Embedded SoC 14
1.5 Block diagram of digital camera 18
1.6 The independence of processor and IC technologies 19
1.7 CISC scenario 20
1.8 RISC scenario 22
1.9 ISA Lexical Level 23
1.10 Variable Instruction Encoding Format 25
1.11 Fixed Instruction Encoding Format 26
1.12 Hybrid Instruction Encoding Format 26
1.13 Memory Trends in SoC 29
1.14 Extent of Embedded memory in the die area in SoCs 30
1.15 Multiple embedded memory IPs in multicore SoC 30
2.1 IBM S370 Instruction Formats 47
2.2 INTEL Pentium Pro Instruction Formats 48
2.3 MIPS32 Instruction Formats 48
2.4 A six stage instruction pipeline 53
2.5 (a) Five stage pipeline 55
2.5 (b) Timing Diagram 56
2.5 (c) RISC Pipeline as a series of datapaths 56
2.6 Superscalar Processor Organisation 61
2.7 VLIW Processor Organisation 62
2.8 Use of Cache memory 63
2.9 Virtual memory concept 65
xv
FIGURE NO. TITLE PAGE NO.
2.10 Virtual memory mechanism 66
2.11 A Quad-core CPU 67
2.12 SPARC64 VII Processor 67
2.13 IBM Codepack Code Compression for Power PC 86
2.14 Dictionary based compression 87
2.15 Decompression procedure for the dictionary
based compression 87
2.16 Memory map of variable instruction stream 92
3.1 Utilized and unutilised instructions in Embedded codes 107
3.2 Distribution of utilized and unutilized instructions in
Embedded domains 108
3.3 Frequency of integer instructions in Embedded codes 109
3.4 Frequency of instructions usage in Embedded domains 110
3.5 Population of FTFI in Embedded codes 111
3.6 Distribution of FTFI in Embedded domains 112
3.7 Usage of full 16 bit immediate by Embedded codes 113
3.8 Trends in usage of 16 bit immediate in Embedded domains 114
3.9 Extent of usage of full 16 bit offset by embedded codes 115
3.10 Trends in usage of 16 bit offset field in Embedded domains 116
3.11 WASTIO Percentages in Embedded applications 118
3.12 Extent of WASTIO in Embedded domains 119
3.13 WASTIO distribution in Embedded domains 120
3.14 Usage of more than 16 registers by Embedded applications 121
3.15 Usage of more than 16 registers in Embedded domains 122
3.16 Usage of more than 16 bit shifts in Embedded applications 123
3.17 Frequency of more than 16 bit shifts in Embedded domains 124
3.18 Extent of branch instructions in Embedded applications 125
3.19 Usage of branch instructions in Embedded domains 126
4.1 MIPS R2000 instruction map 133
4.2 MIPS R2000 registers 138
xvi
FIGURE NO. TITLE PAGE NO.
4.3 Format of and instruction in MIPS32 ISA 142
4.4 addiu instruction with immediate field containing zero value 142
4.5 addiu instruction with only most significant byte of
immediate as zero 142
4.6 addiu instruction with only least significant byte of
immediate as zero 143
4.7 addiu instruction with both bytes of immediate field as
non-zero value 143
4.8 HIE1 RISC Instruction Formats 145
4.9 R Type instruction in HIE1 146
4.10 Effect of HIE1 on Automotive and Industrial
Control Benchmarks 150
4.11 Effect of HIE1 on Network Benchmarks 151
4.12 Effect of HIE1 on Video and Audio Benchmarks 151
4.13 Effect of HIE1 on Image Benchmarks 152
4.14 Effect of HIE1 on Speech Benchmarks 152
4.15 Effect of HIE1 on Security Benchmarks 153
4.16 Effect of HIE1 on Text Benchmarks 153
4.17 Effect of HIE1 on Embedded Segments 154
4.18 HIE2 instruction formats 159
4.19 Code Reduction Comparison between HIE1 and HIE2 162
5.1 RMA Instruction Format – RM Type 179
5.2 RMA Instruction Format – IM Type 179
5.3 (a) Format of LW Instruction 180
5.3 (b) Format of SW Instruction 180
5.4 R-Type ADD instruction in MIPS 182
5.5 I-Type ADD instruction in MIPS 182
5.6 Comparison of object codes of LSA and RMA 185
5.7 Code size Reduction due to RMA for Automotive Domain 186
5.8 Code size Reduction due to RMA for Network Domain 186
xvii
FIGURE NO. TITLE PAGE NO.
5.9 Code size Reduction due to RMA for Video and
Audio domains 187
5.10 Code size Reduction due to RMA for Image Domain 187
5.11 Code size Reduction due to RMA for Speech Domain 188
5.12 Code size Reduction due to RMA for Security Domain 188
5.13 Code size Reduction due to RMA for Text Domain 189
5.14 Comparison of Code size Reduction due to RMA
for Embedded Domains 189
5.15 Proposed 6-Stage RMA Pipeline 191
5.16 Execution of LSA Instructions in 6-Stage RMA Pipeline 192
5.17 Execution of LSA Instruction in 5-stage RMA pipeline 193
5.18 Execution of RMA ADDrm Instruction in 5-Stage
RMA pipeline 193
6.1 Block diagram of smart watch 197
6.2 Block diagram of a scanner 198
6.3 Block diagram for the Snapdragon S4 SoC
using Krait CPUs 200
6.4 Two stage instruction Decoding 206
6.5 Predecoding and Marking Instruction Lengths 206
6.6 Heads and Tails Format 207
6.7 HIE2-MIPS Instruction Types in HAT Scheme 209
6.8 Variable-length decoding in a HAT Scheme 210
A1.1 Functional block diagram of MIDACC 222
A1.2 Algorithm for WASTIO calculation 225
A1.3 Algorithm for Population of FTFI 226
A1.4 Algorithm for Shift Length usage computation 228
A1.5 HIE1 code conversion algorithm 230
A1.6 HIE2 code conversion algorithm 231
A1.7 Overview of RMA code conversion for load sequence 232
A1.8 Overview of RMA code conversion for store sequence 233
xviii
FIGURE NO. TITLE PAGE NO.
A1.9 RMA+HIE1 code conversion 234
A1.10 RMA+HIE2 code conversion 234
A2.1 Snapshot of MIDACC Suite installation folder 235
A2.2 Snapshot of MIDACC Suite welcome screen 236
A2.3 Snapshot of MIDACC installation screen 236
A2.4 Snapshot of MIDACC installation process 237
A2.5 Snapshot of MIDACC installation status 237
A2.6 Snapshot of MIDACC installation completion 238
A2.7 Snapshot of MIDACC icon in desktop and start menu 238
A2.8 Snapshot of MIDACC Suite tool 239
A2.9 Snapshot of assembly code of SUSAN 239
A2.10 Snapshot of input format accepted by MIDACC 240
A2.11 Snapshot of MIDA Tab 240
A2.12 Snapshot of code analysis process using MIDA Tab 241
A2.13 Snapshot of MICC Tab 242
A2.14 Snapshot of HIE1 Code conversion process 243
A2.15 Snapshot of HIE2 Code conversion process 244
A2.16 Snapshot of RMA Code conversion process 244
A2.17 Snapshot of RMA+HIE1 Code conversion process 245
A2.18 Snapshot of RMA+HIE2 Code conversion process 246
xix
LIST OF SYMBOLS AND ABBREVIATIONS
3D - Three Dimensional
AC - Address Calculation
ADPCM - Adaptive Differential Pulse Code Modulation
ALU - Arithmetic Logic Unit
ASIC - Application Specific Integrated Circuit
ASIP - Application Specific Instruction set Processor
BDTI - Berkeley Design Technology Inc
BOPES - Battery Operated Portable Embedded Systems
CAN - Controller Area Network
CBF - Code Bloat Factor
CCD - Charge-Coupled Device
CCRP - Compressed Code RISC Processor
CISC - Complex Instruction Set Computing
CLB - Cache Line address Lookaside Buffer
CM - Cache Memory
CMOS - Complementary Metal Oxide Semiconductor
CODEC - Coder/Decoder
COM - Serial communication interface
CONMANIP - Constant Manipulation
CPI - Clock cycles Per Instruction
CPU - Central Processing Unit
CRC - Cyclic Redundancy Check
DM - Data Move
DMA - Direct Memory Access
DSP - Digital Signal Processing / Processor
EEMBC - Embedded Microprocessor Benchmark Consortium
EPIC - Efficient Pyramid Image Coder
ESC - Escape
EX - Execute
FAT - File Allocation Table
xx
FFT - Fast Fourier Transform
FIE - Fixed Instruction set Encoding
fn - Function
FPGA - Field Programmable Gate Array
FTFI - Frequently used Top Four Instructions
FTP - File Transfer Protocol
GB - Giga Byte
GCC - Gnu Compiler Collection
GPP - General Purpose Processor
GPR - General Purpose Register
GPS - Global Positioning System
GSM - Global Standard for Mobile communications
HAT - Heads And Tails
HDL - Hardware Description Language
HIE - Hybrid Instruction Encoding
HIE1 - Hybrid Instruction Encoding version1
HIE2 - Hybrid Instruction Encoding version2
HIE-MIPS - MIPS with HIE ISA
HIE-RMA - HIE with RMA
HIE-RMA-MIPS - MIPS with both HIE and RMA ISA
HLL - High Level Language
HTML - Hyper Text Markup Language
HTTP - Hyper Text Transfer Protocol
HTTPS - Hyper Text Transfer Protocol Secure
I/O - Input / Output
IC - Integrated Circuit
ID - Instruction Decode
IEEE - The Institute of Electrical and Electronics Engineers
IF - Instruction Fetch
ILP - Instruction Level Parallelism
IM - Immediate and Memory
IOT - Internet Of Things
IP - Intellectual Property
xxi
IR - Instruction Register
IrDA - Infrared Data Association
ISA - Instruction Set Architecture
JPEG - Joint Photographic Experts Group
KB - Kilo Byte
LAT - Line Access Table
LSA - Load Store Architecture
LSB - Least Significant Bit / Byte
LSI - Large Scale Integration / Load and Store Instruction
MAR - Memory Address Register
MBR - Memory Buffer Register
MEM - Memory Access
MICC - MIPS Instruction Code Converter
MIDA - MIPS Instruction Distribution Analyser
MIDACC - MIPS Instruction Distribution Analyser cum Code
Converter
ML - Machine Language
MMS - Multimedia Messaging Service
MMU - Memory Management Unit
MP3 - MPEG-1 or MPEG-2 Audio Layer III
MPEG - Moving Pictures Experts Group
MSB - Most Significant Bit / Byte
NOP - No Operation
NP - Network Processor
NRE - NonRecurring Engineering cost
OCR - Optical Character Recognition
OPX - Opcode Extension
OS - Operating System
PC - Program Counter
PCM - Pulse Code Modulation
PCR - Percentage Code Reduction
PDA - Personal Data Assistant
PGP - Pretty Good Privacy
xxii
PIM - Personal Information Manager
PLA - Programmable Logic Array
PLD - Programmable Logic Device
PMD - Personal Mobile Device
RGB - Red-Green-Blue
RISC - Reduced Instruction Set Computing
RM - Register and Memory
RMA - Register Memory Architecture
ROM - Read Only Memory
RTOS - Real-time Operating System
RZ - Redundant Zero
SD - Secure Digital
SDT - Software Dynamic Translator
SIMD - Single Instruction Multiple Data
SMS - Short Messaging Service
SoC - System-On-a Chip
SPEC - System Performance Evaluation Corporation
SPP - Single Purpose Processor
TCP/IP - Transmission Control Protocol / Internet Protocol
TDMA/FDMA - Time-and Frequency-Division Multiple Access
TIFF - Tag Image File Format
TLB - Translation Lookaside Buffer
TRZ - Total Redundant Zeros
TV - Television
UART - Universal Asynchronous Receiver Transmitter
USB - Universal Serial Bus
VLIW - Very Long Instruction Word
VLSI - Very Large Scale Integration
WAP - Wireless Application Protocol
WASTIO - Wastage in Immediate and Offset Fields
WB - Write Back
1
1. PORTABLE EMBEDDED SYSTEMS AND CODE SIZE
Application of computers has been growing rapidly and spreading to
every field of life. Desire for better performance, reliability and cost
reduction has been fulfilled by newer design concepts and techniques in
both hardware and software. Embedded processing is the new generation
of computing that is revolutionising the way people live, and the way people
act in certain occasions. A wide range of smart and low-cost devices such
as digital watches, cell phones, digital cameras, and portable video games
has penetrated everyone's life. As per the forecast by the Linely group [1],
the embedded processor market in 2015 will exceed $4.0 billion. The
emergence of the Internet of Things (IoT) and the demand for smart
devices in every aspect of life is driving a complete overhaul of traditional
wisdom in the embedded industry.
Most embedded devices are battery-powered and designed with
System-on-a-chip (SoC). In the Battery Operated Portable Embedded
Systems (BOPES), reduction in size, cost and power consumption are
primary requirements unlike the servers and desktops in which
performance is the primary requirement. One of the factors contributing to
increase in product cost, size and power consumption is the large code
size of the embedded application software. CORE-based design with
predefined and pre-verified modules in modern SoCs is the state-of-the art
design strategy.
Due to increasing complexity of embedded systems, the size of
embedded programs keeps growing and hence the code memory occupies
the largest share of the total die area, more than the area of the
microprocessor core and the other on-chip modules. For example, in a
high-end hard disk drive [2], the processor occupies a silicon area of
6 mm², where as the code memory takes 20-40 mm². As a result, apart
2
from increased chip space and cost, the power consumption also
increases. Hence minimizing code size is an essential requirement in
BOPES especially in biomedical embedded systems such as pacemaker
and prosthetic devices. Coping with modern trends in technology, the
BOPES deserve an architectural level solution to minimise code size. The
main goal of this work is to provide an efficient Instruction Set Architecture
(ISA) for RISC processor cores so as to produce minimum object code in
BOPES designed around SoCs.
1.1 EMBEDDED SYSTEMS
An embedded system is a computing system that is embedded
within larger electronic devices, performing one or more fixed functions.
The embedded systems are pre-programmed by the developer with built-in
application program(s). The wide spectrum of embedded systems includes
a variety of applications as illustrated in Table 1.1. Embedded computers
have the widest spread of processing power and cost. The low end
embedded processors cost less than a dime, medium scale embedded
processors cost under $5, and high end processors cost around $100.
Although embedded devices cost less compared to personal computers,
their volume of sale are huge. In the year 2010 alone, 19 billion embedded
processors were sold compared to 1.8 billion Personal Mobile Devices
(PMD), 350 million desktop PCs, and 20 million servers [3].
Like variation in cost, there is a wide variation in the requirements of
different embedded applications. For certain embedded applications such
as network switches, avionics systems, video phones etc., high processor
performance is a critical requirement. In certain embedded systems such
as toys, scanners, washing machines, microwave ovens etc., size and cost
are critical aspects instead of performance. In certain other embedded
applications such as cell phone, tablet computers etc., power consumption
is important, as the major requirement is to minimise the frequent
3
recharging of battery. In prosthetic devices, both size and power
consumption are critical factors.
Table 1.1: Examples of Embedded Systems
Nature of Application Selected examples
Automotive Transmission control, cruise control, fuel injection,
antilock brakes, active suspension, navigation
Consumer electronics
Cell phones, digital cameras, camcorders,
calculators, personal digital assistants, smart
briefcase, smart watch, toys, games
Home appliances
Washing machines, microwave ovens, answering
machines, thermostats, home security systems,
lighting systems, TV set-top boxes, battery
chargers, smart phones, remote controls, coffee
maker, cooker, smart refrigerator, clothes dryer,
MP3 player, smart speakers, trash compactor,
thermostat, Personal Data Assistants (PDAs)
Office automation Fax machines, photocopiers, printers, scanners,
monitors, multifunction device
Business equipment
Alarm systems, card readers, cash registers,
product scanners, automated teller machines,
automatic toll systems, electronic instruments,
point of sales terminals
Biomedical and
healthcare
Patient monitoring system, pacemaker, blood
pressure monitor, electronic stethoscopes,
medical imaging, smart bed, electric wheelchair,
ambulance, hearing aid, prosthetic devices
Defence Wearable computer, signal tracking systems,
missiles
Industrial control Robotics, Factory control
Entertainment Music systems
Communications Routers, modems, network switches, network
bridges, hubs, gateways, satellites
Computer peripherals Hard disk drives, network adapters, printers
Special
Avionic systems, life support systems,
teleconferencing systems, satellite phones,
robots, traffic light controller, police vehicle, fire
control, video conferencing, elevators
4
An artificial cardiac pacemaker, a typical example for the application
of BOPES in health care, is a critical system which is used to treat patients
with various heart conditions in which the natural pacemaker is affected [4].
It is an electronic device placed under the skin near the heart and
generates simulated paces to the heart using electric impulses. Figure. 1.1
shows the block diagram of a pacemaker that contains a processor
functioning as the controller.
Electrodes
Leads
Pacing Unit
Power Source
Sensing Unit
Control Unit
Figure. 1.1: Block diagram of a pacemaker
The pacemaker is a hermetically sealed device containing a power
source, usually a lithium battery, a sensing amplifier which processes the
electrical manifestation of naturally occurring heart beats as sensed by the
heart electrodes, the processor acting as the control logic for the
pacemaker and the output circuitry which delivers the pacing impulse to the
electrodes. Much advancement has been made possible by
microprocessor controlled pacemakers. Instead of producing a static,
predetermined heart rate, a dynamic pacemaker compensates for both
actual respiratory loading and potentially anticipated respiratory loading.
Dual-chamber pacemakers control both the ventricles and the atria and
achieve timing the contractions of the atria to precede that of the ventricles
5
thereby improving the pumping efficiency of the heart and can be useful in
congestive heart failure. Rate responsive pacing allows the device to sense
the physical activity of the patient and respond appropriately by increasing
or decreasing the base pacing rate via rate response algorithms. The
implanted pacemaker is a battery operated real time embedded system
which must be smaller in size and less in weight and must operate with low
power to increase battery life. Pacemakers are programmed with tens of
thousands of lines of code. It is obvious that size and battery life are the
most important parameters in the pacemaker than speed.
1.1.1 Characteristics of Embedded Systems
An embedded system is an applied computer system that has
several characteristics distinguishing it from other types of computing
systems. The main difference is that the embedded system is not used for
general purpose computing but designed for one or more dedicated
applications. On the other hand, a general purpose system is designed to
perform a variety of tasks as per the user's choice. For example, a digital
camera is an embedded system used always as a camera. In contrast, a
desktop computer is used for running a variety of application programs like
spreadsheets, word processors, games etc. In most embedded systems,
the users are not even aware of the presence of a microprocessor inside
the system.
Embedded systems have tight constraints on design metrics. There
are several design metrics for an embedded system as listed in Table 1.2.
6
Table 1.2: Design Metrics for Embedded Systems
Metric Description Remarks
NRE cost
Nonrecurring engineering
cost; the initial cost of
designing and testing the
system
One time nonrecurring cost;
multiple units can be
manufactured without any
additional design cost
Unit cost Manufacturing cost of each
copy of the system Excludes NRE cost
Size Physical space required Measured in bytes / gates /
transistors
Performance Instruction execution time of
the system
Smaller execution time
means higher performance
Power Amount of power consumed
by the system
Determines the life time of a
battery, or cooling
requirements. Decides
frequency of recharging the
battery
Flexibility Ability to change the
functionality of the system
Should not incur heavy NRE
cost
Time-to-
prototype
Time needed to build a
working version of the system
Prototype helps verify the
system's usefulness and
correctness
Time-to-
market
Time required to develop a
system before releasing to
the customers
Includes design time,
manufacturing time, and
testing time
Maintainability Ability to modify the system
after its initial release
Original designers need not
be available
Confidence Correct implementation of
system's functionality
Addition of test circuitry may
be required
Safety Probability that the system
will not cause harm
Built-in safety measures may
be required
7
There are exceptions to each of these constraints; cars, avionics
systems, and medical imaging devices are some examples of embedded
systems in which one or more of these are not satisfied. Often metrics
compete with one another; improving one may affect another. For example,
if an implementation's size is reduced, performance of the implementation
may suffer. Hence optimization of these metrics is a challenge for an
embedded system designer.
Certain Embedded systems are often required to provide real-time
response. Examples of such systems are pacemakers, flight control
systems of an aircraft, and sensor systems in nuclear reactors and power
plants. These embedded systems must continuously sense and monitor
changes in the system's environment and must compute certain results and
respond in real-time within specified time limit. In other words, a portion of
the application program has an absolute maximum execution time, and
certain set of tasks must be completed within the fixed amount of time.
There are two categories of real-time systems: hard and soft. In a hard
real-time system, missing the deadline may cause a damage in which case
it is considered that the system has failed. For example, a car's cruise
controller must react to speed and break sensors and compute acceleration
or deceleration amount within a limited time. A failure to meet the deadline
means loss of control of the car. Avionics, automotive safety and control
systems, and weapons systems are typical examples of hard real-time
systems. In soft real-time systems, timely response with small delays is
acceptable. Examples of soft real-time systems are the scheduling display
system on the railway platforms, washing machines, live audio-video
systems and toys. In these systems, occasional violation of constraints
results in degraded quality, but the systems can continue to operate.
8
1.1.2 Architecture of Embedded Systems
Architecture and usage pattern of embedded systems differ from
those of general purpose desktops and servers [5]. The process of
embedded system design and development described by Tammy
Noergaard [6] consists of four phases namely creating the architecture,
implementing the architecture, testing the system, and maintaining the
system. At the highest level, a commonly used primary architectural tool is
the Embedded System's Model, illustrated in Figure. 1.2. The hardware
layer is present in all embedded systems but the system software layer and
application software layer may exist either as independent layers or as a
combined layer depending on the complexity of the embedded system.
Figure. 1.2: Embedded Systems Model
In terms of workload, there are basically three different styles in
embedded systems: controlling, switching and routing, and media
processing [7]. Examples for controlling workload are found in various
appliances, automotive, and industrial environments. These applications
have light computations, and a tight coupling with a set of peripherals and
sensors. A strong real-time component is invariably present in such
control-dominated applications. Switching and routing category involves
control applications that handle streams of data in networking applications.
These workloads have to move large amounts of data at short intervals in
real-time. As a result, buffering capabilities are required. Due to need for
concurrent processing of multiple independent streams, efficient
multithreading support is essential. Media processing category involves
9
multimedia applications such as videos, audios and images which are
complex and diverse requiring high level of computational performance.
Apart from real-time restrictions, heavy computational workloads, large
memory bandwidth and capacity requirements are typical constraints in
these embedded systems and hence dedicated special hardware support is
provided along with embedded CPU.
1.1.3 Embedded Software
The wide spectrum of embedded computing applications is broadly
divided into four different types: image processing and consumer market,
communications market, automotive market, and special area markets
such as medical, military, industrial control and avionics [7]. In some of
these applications, real time performance is critical but in others, size, cost
and power consumption are critical rather than performance. In most
embedded systems, the primary goal is achieving the performance at a
minimum price rather than attaining higher performance at a higher price
[3]. The entire software of an embedded system is placed in the ROM
or flash memory since the embedded system is not user
programmable. Figure. 1.3 shows typical software components
needed to control an embedded device.
A Real-Time Operating System (RTOS) is a computing environment
that reacts to input within a specific time period. A real-time deadline can
be so small that system reaction appears instantaneous. Some RTOS
implementations are very complete and very robust, while other
implementations are very simple and suited for only one particular purpose.
An RTOS may be either event-driven or time-sharing. An event-driven
RTOS is a system that changes its state only in response to an incoming
event. A time-sharing RTOS is a system that changes its state as a
function of time.
10
Figure. 1.3: Software Components in Embedded Systems
A kernel is the central core of an operating system, and it takes care
of all the OS jobs: booting, task scheduling, and standard function libraries.
In an embedded system, there is rarely enough memory to maintain a large
function library and hence only essential functions must be included. The
kernel will boot the system and initialize the ports and the global data items.
Then, it will start the scheduler and instantiate any hardware timers that
need to be started. Finally, the kernel gets dumped out of memory (except
for the library functions, if any), and the scheduler will start running the child
tasks. The kernel of a real-time operating system provides an "abstraction
layer" that hides from application software the hardware details of the
processor (or set of processors) upon which the application software will
execute.
In addition to the core operating system, many embedded systems
have additional upper-layer software components. These components
consist of networking protocol stacks like CAN, TCP/IP, FTP, HTTP, and
HTTPS, and also include storage capabilities like FAT and flash memory
management systems. If the embedded device has audio and video
11
capabilities, then appropriate drivers and codecs will be present in the
system.
Most embedded systems are architecturally simpler and do not use
advanced memory management concepts such as virtual memory, and do
not have hard disk. Certain portable embedded systems may not be able to
bear the cost of the RTOS or may have very simple scheduling
requirements that can be managed by a simple monitor eliminating the
need for an operating system. There are certain products such as
automobiles that have multiple embedded systems. Present-day
automobiles have hundreds of processors. Table 1.3 illustrates various
software components required in some typical embedded systems [8].
Table 1.3: Types of Software Components in Embedded Systems
Embedded
System Software Components / functions
Smart card 1.Boot-up, initialisation and OS programs
2. Smart card secure file system
3. Connection establishment and termination
4. Communication with host
5. Cryptography algorithm
6. Host authentication
7. Card authentication
8. Saving additional parameters sent by the host
Digital camera 1. CCD signal processing for offset correction
2. JPEG coding
3. JPEG decoding
4. Pixel processing before display
5. Memory and file systems
6. Light, flash and display device drivers
7. COM, USB port and Bluetooth device drivers for port
operations for printer and communication control
12
Table 1.3 (Continued)
Embedded
System Software Components / functions
Mobile phone 1. Memory and file systems
2. Keypad, LCD, serial, USB, 3G or 2G port device drivers for
port operations for keypad, printer and computer
communication control
3. SMS (Short Messaging Service) message creation and
communicator, contact and PIM (personal information
manager), task-to-do manager and email
4. Mobile imager for uploading pictures and Multimedia
Messaging Service (MMS)
5. Mobile browser for access to the web
6. Downloader for Java games, ringtones, games, wall
papers
7. Simple camera
8. Bluetooth synchronization, IrDA and WAP connections
support
Mobile
computer
1. OS
2. Touch screen GUIs, memory and file systems
3. Memory stick
4. Outlook, Internet explorer, Word, Excel, PowerPoint, and
handwritten text processor
5. Applications or enterprise software
Automobile 1. Engine control
2. Speed control and brake
3. Safety systems
4. Seat and pedal controls
5. Car environment controls
6. Route and traffic monitors
7. Automobile status monitoring
8. System interfaces for commands, voice activation, and
interfacing
9. Infotainment systems
13
Table 1.3 (Continued)
Embedded
System Software Components / functions
Hard disk drive 1. Motor control
2. Data decoding
3. Disk scheduling
4. On-disk management tasks
5. Off-disk management tasks
Pacemaker Basic functions: sensing, pacing, and lead impedance
measurement
1.2 SoC ARCHITECTURE
Thanks to advancements in VLSI design, most of today's portable
embedded systems are designed as System-on-a-chip (SoC) in which
multiple IC chip logics are implemented in a single die thereby housing
entire embedded system on a single chip. A block diagram illustrating the
organization of a typical SoC is given in Figure. 1.4. Different components
of the SoC may be of different technologies. For example, a SoC may
consist of one or more embedded microprocessors/microcontrollers, digital
signal processors (DSP), application specific circuits and memory. SoCs
are complex integrated circuits and permit integration of blocks from
several vendors. These components are being sold in the form of
Intellectual Property (IP) as three modes: hard cores, firm cores, or soft
cores. A hard core is a physical description of the IP design provided in a
variety of physical file formats. It is best for plug-and-play applications since
it is fully tested already, and is less portable and flexible than the other two
types of cores. The firm core carries structural description of a component
typically provided in a Hardware Description Language (HDL) and is
configurable to various applications.
14
Figure. 1.4: Organisation of an Embedded SoC
The most flexible of the three different cores, the soft core is a
synthesizable behavioral description of a component and exists either as a
netlist (a list of the logic gates and associated interconnections making up
an integrated circuit) or as HDL code. The facility of design reuse of SoC
components using IP cores helps designers to reduce development time
since a new SoC design need not start from scratch. Table 1.4 lists some
typical commercial SoCs and their applications. Multicore SoCs and
Programmable SoCs are common in today's BOPES. In a typical SoC, the
memory occupies over 60% of the chip area [8].
15
Table 1.4: Typical SoCs and Applications
SoC Name Manufacturer Typical Application
Tegra3 Nvidia Tablet
Snapdragon Qualcom Tablet, Smart phone
AZZ10 Intel Mobile device
Edison Intel Tiny computer
Exynos 5 Samsung Tablet
OMAP4430 Texas Instruments Google Glass
Zynq 7000 Xilinx Automotive, aerospace & defence,
broadcast, consumer, industrial,
medical, communications
VC2100 Agera Disk controller
ST L7250 ST Microelectronics Disk controller
IXP 1200 Intel Network processor
EP9312 Cirrus logic Audio processor
OMAP 1510 Texas Instruments Mobile multimedia
ASMgrid Accent Home Area Networking
1.3 BOPES DESIGN PHILOSOPHY AND PROCESSOR TECHNOLOGY
The desired functionality of an embedded application can be
implemented on any of the three different processor types: Programmable
General-Purpose Processor (GPP), Single Purpose Processor (SPP), and
Application-Specific Instruction set Processor (ASIP). In practice, a
combination of such processors is used in designing an embedded system
in order to optimize a system's design metrics.
The general-purpose processor is designed for a given Instruction
Set Architecture (ISA) with a micro architecture that is not known to the
application software designers. The embedded system designer merely
16
programs the processor for the desired functionality by developing suitable
programs and storing them in the program memory of the processor. This
approach offers several design metric benefits. Since the embedded
system designer needs to do mere program development and not digital
design, time-to-market and NRE costs are low. Also, there is a high degree
of flexibility as the designer can change the functionality by merely
replacing the program. Being a general approach, performance may be
slow for certain embedded applications. Since the processor is a standard
commercial product manufactured in large quantities, even small quantities
are available in low cost for the embedded system developer and hence,
unit cost of the embedded system works out to be low. However, if the
embedded system is to be sold in large quantities, then choice of
application specific processor will result in a cheaper product. Due to the
fixed processor hardware, size and power may be large for certain
embedded applications. A general purpose processor may have any of
different architectures such as scalar processor, vector processor, array
processor, superscalar processor, VLIW processor, and multicore
processor [9]. A given processor may offer instruction level parallelism; the
superscalar and VLIW are two different approaches, hardware and
software, to instruction level parallelism. The multicore processor houses
more than one processor core in a single chip thereby providing processor
level parallelism.
A single purpose processor is a digital circuit capable of executing
only one type of program. Some common examples of single-purpose
processors are UART, DMA controller, JPEG codec etc. The embedded
system designer can either pick a suitable pre-designed single-purpose
processor from the market or create a custom digital circuit for the single-
purpose processor. The advantages of using single-purpose processor in
embedded system design are higher performance, smaller size, smaller
power and low unit cost (for large quantities). The disadvantages are higher
design time, higher NRE costs, low flexibility and higher unit cost (for small
17
quantities). For some applications, performance may be lower compared to
the designs using general-purpose processors. The single-purpose
processor is also known by several popular names: coprocessor,
accelerator, peripheral etc.
An ASIP is a programmable processor for a specific type of
application such as digital-signal processing, or telecommunications. The
architecture of such a processor is optimized for giving better performance
for the target application type. Inclusion of special functional units and
exclusion of infrequently used functional units are two common strategies
used in designing an ASIP. The advantages of ASIP approach to
embedded system design are good performance, small size and small
power. A large NRE cost is the disadvantage of ASIP approach. Digital
Signal Processors (DSPs), Network Processors (NPs), and
microcontrollers are typical examples for ASIPs.
A digital camera is a typical example for a BOPE that can be
implemented using a mixture of GPP, ASIP and SPP. As illustrated in
Figure. 1.5, a digital camera is a camera that encodes images and videos
digitally and stores them for later reproduction. It performs a limited set of
functions such as capturing pictures, compressing images, storing frames,
decompressing and displaying frames, and uploading frames to another
device through a suitable I/O interface. Frank Vahid and Tony Givargis [5]
have discussed the pros and cons of four different design approaches to
the digital camera and compared three design metrics of interest namely
performance, power consumption and chip area. The first approach uses a
single GPP but could not meet the performance requirement. The other
three approaches give feasible solutions and the choice depends on the
target market segment.
18
Figure. 1.5: Block diagram of digital camera
1.3.1 IC Technology
Implementation of a processor on an integrated circuit (IC) can be
done with any of the three IC technologies: Full-custom/VLSI, Semicustom
ASIC (Gate array and standard cell) and Programmable Logic Device
(PLD). Any type of processor can be mapped [5] to any type of IC
technology, as illustrated in Figure. 1.6. The three IC technologies differ by
how customized the IC is for a particular design. The VLSI design has a
very high NRE cost and long turnaround time, typically several months; but
yields excellent performance with small size and power. It is suitable for
high volume or extremely performance-critical applications. The ASICs
provide good performance and size, with much less NRE cost than
full-custom ICs with turnaround time in the order of weeks. The PLD has
two types: Programmable Logic Array (PLA) and Programmable Array
Logic (PAL). Field Programmable Gate Array (FPGA) is a newer type of
PLD. PLDs offer very low NRE cost and almost instant IC availability.
Bigger size than ASIC, higher unit cost, higher power consumption and
lower performance are the drawbacks of PLDs, but they are well suited for
rapid prototyping.
19
General
Purpose
Processor
ASIP Single
Purpose
Processor
Full
Custom
Semi-
Custom PLD
Flexibility, NRE cost, Time-to-market, cost (low volume)
Power efficiency, performance, size, cost
Figure. 1.6: The independence of processor and IC technologies
1.4 PROCESSOR ARCHITECTURES AND INSTRUCTION ENCODING
Based on the type of internal storage inside the processor, the
instruction set architectures are classified into three types: stack
architecture, accumulator architecture and general-purpose register
architecture [3]. The stack architecture and accumulator architecture have
become obsolete, and Register-Memory Architecture (RMA) and Load-
Store Architecture (LSA) are two popular versions of general purpose
architecture used in microprocessors.
1.4.1 CISC Vs RISC
Two different Instruction Set Architecture (ISA) styles [9] are
followed in present day computer systems: Complex Instruction Set
Computing (CISC) and Reduced Instruction Set Computing (RISC). In
practice, CISC processors use RMA whereas RISC processors use LSA.
20
Reference
Compiler
Source
Code
Object
Code
CPU
(Complex)
Instruction
(Powerful)
Main
Memory
(Small)
Instruction set
(large)
High Level
Language Program
Machine Language
Program
Figure. 1.7: CISC scenario
Figure. 1.7 gives the overall view of a CISC system. The CISC has
powerful instructions and large instruction set. In early days of main frames,
due to use of magnetic core memory, cost of memory was high. Since
CISC architecture results in compact object code, the CISC processors
were accepted well. IBM System/360, UNIVAC 1100, HP 2100, and
VAX 11 are some popular CISC systems. Developments in IC technology
gave more scope for implementing new concepts and techniques that
required more circuits in the CPU. After the invention of semiconductor
memories, the performance of memory improved and the cost of memory
fell drastically but still the speed of memory is relatively slower than that of
the CPU. The invention of microprocessor resulted in low cost systems.
21
Advancements in VLSI technology enabled inclusion of more circuits
inside the microprocessor for performing new functions such as instruction
pre-fetch and pre-decode, multitasking and virtual memory support. Since
CISC processors supported powerful and complex instructions, control unit
design used microprogramming in order to simplify the design process but
due to the microprogram memory, instruction execution time increased.
In the past, the general trend in computer architecture and
organization has been toward increasing processor complexity: more
instructions, more addressing modes, more specialized registers, and so
on [10]. The RISC concept represents a fundamental break from the CISC
philosophy. As part of the attempts to develop a faster processor, RISC
architecture (Figure. 1.8) was promoted eliminating complex instructions
and complex addressing modes. The major characteristics of initial RISC
processors [10] are simple instructions, small instruction set, uniform
instruction length, limited addressing modes, simple instruction formats,
load-store architecture, hardwired instruction decoder, large register count
and instruction pipelining.
The main advantages of RISC architecture are easy implementation
of instruction pipeline and simplification of instruction decoder circuitry, and
higher performance. Moreover, due to its simplicity, time to develop a RISC
processor is shorter compared to that of a CISC processor.
The RISC Vs CISC controversy has died down due to gradual
convergence of the technologies. The RISC systems have become more
complex and CISC systems have introduced certain RISC like features.
However, there is a need to take a relook at following two features of RISC
processor cores used in SoCs for BOPES, namely uniform instruction
length and load-store architecture, from the perspective of increased code
size of RISC architecture, impacting cost, size and power consumption.
22
Instruction
Main
Memory
Compiler
Source
Code
Object
Code
CPU
(Simple)
(Simple)
(Large)
Reference
Instruction set
High Level Language
Program
Machine Language
Program
Figure. 1.8: RISC scenario
As shown in Figure. 1.9, as the lexical level of a CISC is higher, a
CISC requires execution of fewer instructions (smaller bit traffic) than does
a RISC [11]. The CISC architecture moves the ISA upward, thereby
reducing the semantic gap that must be spanned by the compiler and
increasing the semantic gap spanned by the hardware. On the other hand,
RISC architecture increases the software semantic gap and decreases the
hardware semantic gap. The experiments by Bhandarkar and Clark [12]
established that the RISC processor has to execute twice the number of
instructions compared to a CISC processor for the same application
program.
23
Software Translation
High Level
Language
Hardware Translation
CISC
RISC
ISA
Gates
Figure. 1.9: ISA Lexical Level
This ‘code size bloating’ problem of RISC processors depicted in [2]
shows the object code size of an MPEG2 encoder compiled on multiple
processors of different architectures. The Intel x86, a typical CISC
processor with register-memory architecture needs 50.6 kB of code, while
the RISC processors Thumb and SHARC need 68.2 kB and 106.2 kB
respectively.
1.4.2 Load - Store Architecture (LSA)
Most modern processors are based on load-store architecture. The
original objective [13] of choosing LSA for RISC is simplifying the hardware
and increasing performance so as to meet the performance needs of
workstations and servers. Generally, RISC processors have three types of
instructions: ALU instructions, Load and store instructions and Branch
and Jump type instructions. For ALU instructions, the operands are in
24
registers and the results stored in registers. In Load and Store
instructions, one operand is in register and the other operand is in
memory. The address of the memory operand is generally specified as
the sum of two parts: the base register contents and an offset in the
immediate field. In LSA, only load and store instructions can access
memory operands, and the arithmetic/logical instructions can only operate
on register operands. Since arithmetic and logical operations on memory
operands are not permitted in LSA, the compiler should place a load
instruction, before an add instruction, to move the data from memory to
register. Similarly, the result of an add instruction is stored by the processor
in a register only. Hence a store instruction has to be placed by the
compiler, after the add instruction, for moving the result to main memory.
This restriction results in too many data transfer instructions namely load
and store instructions. A comparison [3] of distribution of Arithmetic/logic
instructions and data transfer instructions for two benchmark programs on
VAX and MIPS is shown in Table 1.5. The 50% to 100% increase in data
transfer instructions for the MIPS, compared to the VAX, is due to use of
several load and store instructions in MIPS. The fixed instruction size of
RISC is another cause for code size increase.
Table 1.5: Extent of Data Transfer Instructions in CISC and RISC
Program Processor Arithmetic/logic
instructions
Data transfer
instructions
Gcc
VAX 40% 19%
MIPS 35% 27%
Spice VAX 23% 15%
MIPS 29% 35%
25
1.4.3 Fixed, Variable and Hybrid length encoding
There are three choices for encoding the instruction set [3]: variable
length, fixed length and hybrid. The variable length encoding (Figure. 1.10)
allows all addressing modes for all operations and supports any number of
operands. It results in smallest object code since there are no unused
fields. In this type of instruction format, the instruction length varies on the
basis of opcode and address specifiers. The characteristics of Variable
Instruction Format are
(1) Difficult control design to compute next address
(2) Complex operations
(3) Slow due to several memory accesses
operation Address
specifier
Address
field 1
…… …… Address
field n
Figure. 1.10: Variable Instruction Encoding Format
Processors that used variable instruction encoding include Intel
80x86 and VAX. The VAX offered excellent code density due to powerful
addressing modes, powerful instructions and efficient instruction encoding.
To reduce code size, the VAX permitted three different lengths of
addresses for displacement addressing - 8-bit, 16-bit and 32-bit addresses.
The fixed length encoding (Figure. 1.11) permits only single size for all
instructions. It always has the same number of operands and the
addressing mode is specified as part of the opcode. It results in the largest
code size. The Characteristics of fixed instruction format are
Simple to decode
Wastes code space because of fixed length fields and simple
operation
Helps easy implementation of pipelining
26
operation Address field 1 Address field 2 Address field 3
Figure. 1.11: Fixed Instruction Encoding Format
All RISC processors use fixed length encoding and some of these
are ARM, MIPS, PowerPC, SuperH, Alpha and SPARC. A processor
architect, more interested in code size than performance will choose
variable length encoding, and an architect more interested in performance
than code size will choose fixed length encoding. The hybrid length
encoding (Figure. 1.12) allows multiple formats specified by the opcode. In
other words, in hybrid length encoding, a processor supports multiple fixed
instruction lengths.
operation Address specifier Address field 1
operation Address specifier1 Address specifier2 Address field
operation Address specifier Address field 1 Address field 2
Figure. 1.12: Hybrid Instruction Encoding Format
IBM 360/370 and TI TMS320C54x are some of the processors using
the hybrid approach. Though hybrid length encoding reduces the code size
compared to fixed length encoding, instruction decoding becomes more
involved and there will be performance reduction. The Fixed Instruction
Encoding (FIE) of RISC processors helps in simpler instruction decoding
and easy pipeline design [3]. But the FIE increases the code size as some
fields are either unused or underutilized in several instructions. The
desktops and server systems are not seriously affected by the large code
memory size since both the code memory and data memory are external to
the processor chip in these systems. On the other hand, in most
embedded systems, the present trend is use of SoCs wherein the code
27
memory is integrated with the processor and the other system hardware on
a single chip. This limits the available space for the application memory for
the SoC architecture and hence the need for a compact code.
1.5 MOTIVATION
The applications of computers and architecture of computer systems
have undergone rapid growth over the years. The demand for increased
performance has been met by several architectural innovations. Several
new areas of applications have given rise to new requirements other than
high performance. Embedded systems are one such area that has grown
rapidly. Though some types of embedded systems require high
performance, majority of embedded systems are sensitive to size and
power consumption.
For the past three decades, there has been steady increase in
computer performance by increasing clock frequency or by introducing
overlap and parallelism. The clock frequency has reached its peak and
designers have given up any further effort in this direction. In 2004, Intel
cancelled a line of 4+ gigahertz processors due to difficulty controlling the
heat generated by such fast clock rates. Efforts towards instruction level
parallelism also changed direction from superscalar architecture to VLIW
architecture due to saturation in performance. Ultimately, the era of
advances in instruction level parallelism has come to an end, and instead,
processor level parallelism with multicore architecture has become the
current trend both for performance and parallelism. The density of
transistors on a single chip roughly doubles every two years obeying the
Moore's Law, a prediction made by Gardon Moore nearly 49 years ago
[14]. Additionally, logic is becoming less expensive in terms of area and
power consumption while communication is increasingly costly.
28
The RISC architecture was promoted in the days when
microprocessors had become complex, there were limitations of including
additional on-chip hardware, and embedded systems had limited market
[12, 15, 16, 17]. Those were the days when the entire program memory
was external to the processor. Hence the processor architecture was
viewed in isolation. Either processor performance or processor power
consumption alone was the main objective rather than overall system
parameters. Today, as the entire BOPES is available as single SoC, it is
desired to have a processor architecture that contributes to the overall
benefits to the SoC in terms of chip space, power consumption and cost,
rather than optimising these parameters for the processor core in isolation.
The performance provided must not come at the expense of unreasonable
power consumption or chip space. Considering the rapidly expanding
BOPES market, the savings in cost, space and power consumption can
justify the investments in development of a new tool chain for the processor
architecture. Although several strategies and techniques for reducing cost,
space and power consumption have been successfully used in practice, the
proposed solution for code size reduction by modification to RISC
architecture can give all the three gains in a single stroke. This approach
eliminates any additional effort required on the part of the embedded
system developers. The resultant savings in code size and proportional
reduction in code memory space are highly relevant in SoC based
applications such as wearable devices, implantable medical devices,
surveillance devices etc. In modern embedded systems, area and power
consumed by the memory subsystems is 10 times that of the datapath [18,
19].
The memory subsystem forms a large part (typically up to 70%) of
the silicon area of the current day SoC and expected to go up to 94% in
2014 as shown in Figure. 1.13 [20]. The allocation of physical real estate
(die area) of typical large ASIC and SoC designs tends to fall into three
general groups: die area dedicated to new custom logic, die area dedicated
29
to reusable logic (3rd-party IP or legacy internal IPs), and die area used for
embedded memory.
As Figure. 1.14 shows, while companies continue to develop their
own key custom blocks that help to differentiate their chips in market (like
wireless DSP+RF for 802.11n, Bluetooth, and other emerging wireless
standards), and third-party IPs (such as USB cores, Ethernet cores, and
CPU/Micro-controller cores) occupy a fairly consistent percentage of die
area, the percentage of area used for embedded memory is increasing
dramatically. According to data from Semico Research, in 2013, the
majority of SoC ASIC designs allocate over 50% of their die area to various
embedded memories.
Figure. 1.13: Memory Trends in SoC
30
Figure. 1.14: Extent of Embedded memory in the die area in SoCs
Figure. 1.15: Multiple embedded memory IPs in multicore SoC
31
In addition, there is a wide variety in the purpose and ideal
characteristics of the many embedded memories in a large SoC, as seen in
Figure. 1.15. Consequently, it is very important for processor architects to
evolve a new ISA that suits the present trends in Embedded SoCs so as to
minimize code memory size.
1.6 RESEARCH OBJECTIVES
The overall aim of the research undertaken is to develop a set of
architectural changes to the RISC architecture for reducing code size in
SoC based Battery Operated Portable Embedded Systems (BOPES). The
goal of this thesis is to justify the need to replace the 'uniform instruction
size' feature by 'hybrid instruction size' in the embedded RISC cores used in
BOPES so as to minimize the code size for embedded programs. This
thesis proposes replacement of FIE with Hybrid Instruction Encoding (HIE)
with two modifications to RISC architecture: multiple instruction sizes, and
hybrid lengths for the offset and immediate fields. The provision for multiple
instruction sizes eliminates unused fields in most instructions thereby
reducing code size. Similarly, allowing hybrid lengths for the offset and
immediate fields minimizes wastage of bits in these fields.
The estimates of code size savings and area reduction in SoCs
based on the proposed processor are done. A code analysis cum
conversion suite has been developed and various tools of this custom built
suite are used in different phases of the research work. The present
research work has established that architectural modification leads to
reduction in the code size of over 44% for certain portable embedded
applications. Such a gain is highly significant in certain healthcare products
such as pacemakers and bio-medical multiprocessor SoC for neuropathic
applications.
32
1.7 CONTRIBUTIONS
With this research objective in mind, several investigations have
been carried out and the main contributions of this work are summarized
below.
Behaviour Analysis of Embedded Applications
A code analysis tool for profiling object codes of MIPS32 (a typical
RISC processor) is developed. Analysis of RISC object codes for 23
embedded applications is performed using the code analyser tool to
determine the strategy for minimising the code size.
Design of Hybrid Instruction Encoding for RISC Processors
Two versions of Hybrid Instruction Encoding (HIE) are designed for
supporting multiple instruction sizes and hybrid lengths for the offset and
the immediate fields. For each of the 66 integer instructions of MIPS32, an
equivalent HIE instruction has been designed.
Design of Register Memory Architecture (RMA) for RISC Processor
This part of the research work involves design of 12 RMA ALU
instructions, each of which replaces a sequence of two RISC instructions,
for MIPS processor. The traditional RISC pipeline sequence is rearranged
to suit both LSA and RMA instructions.
Design of Embedded RISC Processor
The fourth part of the research work deals with designing a hybrid
processor incorporating both HIE and RMA. The embedded object codes of
MIPS32 are recoded to the HIE-RMA processor using the custom built
33
code converter so that the code size reduction is measured. Further,
additional code reduction is explored with use of compound instructions
and composite instructions.
Developing Static Simulator for HIE / RMA
To estimate the efficiency of the HIE and RMA for RISC processor, a
code converter tool, MIPS Instruction Distribution Analyser cum Code
Converter (MIDACC), has been developed. The MIDACC converts the
object codes from MIPS ISA to HIE/RMA-ISA.
1.8 THESIS OVERVIEW
The rest of the thesis is organized as follows. The following chapter
provides the background material for the thesis. It begins by presenting an
overview of different types of embedded processors. It describes the
various attributes of Instruction Set Architecture (ISA) and discusses the
processor performance and high performance architectural features. Finally,
an overview of techniques for designing for low power consumption is
presented.
In Chapter 3, an analysis of the object codes of 23 embedded
applications from MiBench and MediaBench benchmarks is provided. The
behavioral pattern of MIPS object codes of the 23 embedded applications
using MIDA, the custom built code analyzer tool for MIPS32 object codes
are discussed. Apart from measuring the static instruction frequencies, this
tool calculates the total amount of under utilization of offset and immediate
fields in the object codes.
Hybrid Instruction Encoding for RISC processors is addressed in
Chapter 4. Here, two different HIE designs for MIPS processor are
34
proposed and the embedded domains suitable for each of these are
identified.
The Chapter 5 presents register-memory architecture for the MIPS
processor. The hardware redesign required supporting the RMA at the
micro architectural level and the impact of RMA on processor performance
are addressed.
In Chapter 6, the design of an embedded core using both HIE and
RMA concepts is explored. Integration of HIE with RMA along with use of
compound instructions and composite instructions in HIE2 code is done to
evaluate the overall code reduction. This chapter outlines relevant micro
architecture requirements and effectiveness of such a processor in the
present scenario with multi core SoCs for battery-powered hand held
embedded systems. Finally, this chapter summarizes the research work and
discusses limitations and possibilities for further research.
35
2. BACKGROUND AND RELATED WORK
This chapter provides the necessary background information that is
useful to understand the main contributions of the thesis. The following
section presents a brief discussion on techniques for designing for low
power consumption. Section 2.2 describes the various attributes of
Instruction Set Architecture (ISA). Section 2.3 discusses processor
performance and high performance architectural features. In section 2.4, an
overview of different types of embedded processors is presented.
Section 2.5 deals with the architectural aspects of the embedded systems.
Section 2.6 presents an overview of emergence of different RISC
processors including MIPS. Section 2.7 explains how MIPS32 instructions
waste bits. In section 2.8, various techniques, followed for embedded code
size reduction, are reviewed. Finally, the need for a new and dedicated ISA
for Embedded SoCs is elaborated in section 2.9.
2.1 DESIGN FOR LOW POWER CONSUMPTION
Power dissipation and energy efficiency are primary design
constraints for both simple and complex processors. As a result of the
growing market for battery-powered portable embedded systems, the drive
for minimum power consumption has become equally important as the
drive for increased performance. Power consumption in processors
consists of a static component, called leakage power, and a dynamic
component, called switching power. The total power consumption of CMOS
circuit comprises three components [21]:
1. Switching power: This is the power dissipated by charging and
discharging the gate output capacitance, CL, and represents
the useful work performed by the gate. The energy per output
36
transition is given by the following equation where Vdd is power
supply voltage:
2
t L dd
1E = .C . V = 1picojoule
2 (2.1)
2. Short-circuit power: When the gate inputs are at an intermediate
level, both the p- and n-type networks can conduct. This results
in a transitory conducting path from Vdd to Vss. In a careful design
that avoids slow signal transitions, the short-circuit power is
usually a small fraction of the switching power.
3. Leakage current: The transistor networks do conduct a very
small current when they are in their 'off' state. Though it is
generally negligible in an active circuit, it can drain a supply
battery over a long period of time.
In a well designed active circuit, the switching power dominates, with
the short-circuit power forming 10% to 20% of the total power, and the
leakage current being significant only when the circuit is inactive.
Therefore, the total power dissipation, Pc, of a CMOS circuit, neglecting the
short-circuit and leakage components, is given by summing the dissipation
of every gate g in the circuit C:
2 g
C dd g L
g C
1P = .f. V . A . C
2 (2.2)
where f is the clock frequency, Ag is the gate active factor (reflecting the fact
that not all gates switch every cycle) and g
LC is the gate load capacitance.
The typical gate load capacitance is a function of the process
technology and therefore not under the control of the designer. The
37
remaining parameters in the equation suggest following approaches to low-
power design:
1. Minimize the power supply voltage, Vdd.
2. Minimize the circuit activity, A. Techniques such as clock gating
fall under this heading.
3. Minimize the number of gates. Simpler circuits use less power
than complex ones, all other things being equal.
4. Minimize the clock frequency, f. Although a lower clock rate
reduces the power consumption, it also reduces performance
having a neutral effect on power-efficiency. If, however, a
reduced clock frequency allows operation at a reduced Vdd, this
will be highly beneficial to the power-efficiency.
5. Exploit parallelism. Duplicating a circuit allows the two circuits
to sustain the same performance at half the clock frequency of
the original circuit, which allows the required performance to be
delivered with a lower power supply voltage.
Although static leakage power has historically been small compared
to dynamic switching power, the situation is changing as the feature sizes
decrease. The smallest chip size of a chip process technology refers to the
smallest size of transistors, wires, or gaps between them that can be
created onto the chip die with that process technology. As these sizes
decrease, the capacitance of the system of transistors,
g
LC , is lowered. This
reduced capacitance decreases the switching time of these transistors (or
gate delay), resulting in faster logic performance accommodating faster
processor clock frequencies. The gate activity factor approximates the
average switching activity of the circuit for each clock edge. The supply
voltage, Vdd, is lowered to reduce interference with the ever-closer
neighbouring components and to meet thermal requirements. Lowering Vdd
greatly reduces dynamic power consumption since the dynamic power is
38
proportional to the square of this supply voltage. However, lowering the
supply voltage in turn often requires a lowering of the threshold voltage, the
voltage level at which transistors switch, to maintain fast clock rates.
Lowering the threshold voltage and moving the threshold closer to ground
causes a disproportionate increase in the static leakage current and thus
an increase in static power consumption [22].
For a fixed task, decreasing the clock rate reduces the power, but
not the energy. The energy to execute a workload is equal to the average
power multiplied by the execution time for the workload. For BOPES
devices, battery life is more important than actual power consumption.
Hence energy is the proper metric.
2.2 INSTRUCTION SET ARCHITECTURE (ISA)
The features that are built into architecture’s instruction set are
commonly referred to as the Instruction Set Architecture or ISA. The ISA
defines such features as the operations that can be used by the
programmers to create programs under that architecture, the operands
(data) that can be accepted and processed by architecture, the storage, the
addressing modes used to gain access to and process operands, and
handling of interrupts. These features are important because an ISA
implementation is a determining factor in defining important characteristics
of an embedded design, such as performance, design time, available
functionality, and cost. In the embedded domain, it used to be true that
minimizing gates was the most important consideration of an ISA design
[7]. This is what led to many of the idiosyncrasies of early DSP designs.
Advances in VLSI technologies have changed this, and most of the
embedded world can now afford enough complexity to allow much more
regular and orthogonal instruction sets.
39
2.2.1 Instruction Types and Operations
The following information is provided either directly or indirectly by
an instruction [9]:
1. Operation code (opcode): Nature of operation done by the
instruction
2. Data: Type of data - binary, decimal, character etc.
3. Operand location: Memory, register etc.
4. Operand addressing: Method of specifying the operand location
(address)
5. Instruction length: Size - one byte, two bytes etc.
6. Number of address fields: zero address, single address, two
address etc.
Two computers of different architectures do not have the same
instruction set. Almost every architecture provides certain unique
instructions that ease the burden of compiler/programmer or the hardware
design. Based on the operations performed by the instructions, it is
common to classify the instructions into following types:
1. Data transfer instructions: These move data from one
register/memory location to another.
2. Arithmetic instructions: These perform arithmetical operations.
3. Logical instructions: These perform Boolean logical operations.
4. Control transfer instructions: These modify program execution
sequence.
5. Input/output (I/O) instructions: These transfer information
between external peripherals and system nucleus
(CPU/memory)
6. String manipulation instructions: These manipulate strings of
byte, word, double word etc.
40
7. Translate instructions: These convert the data from one
format to another.
8. Processor control instructions: These control the processor
operation.
Table 2.1 lists sample instructions for each of the above eight types
and corresponding actions done by the processor for these instructions.
Table 2.1: Sample Instructions and processor actions
Instruction
Type Specific Instruction examples and processor actions
Data transfer Instruction Action by processor
MOVE Transfer data from source location to
destination location
LOAD Transfer data from a memory location to a
CPU register
STORE Transfer data from a CPU register to a
memory location
PUSH Transfer data from the source to stack (top)
POP Transfer data from stack (top) to the
destination
XCHG Exchange; swap the contents of the source
and destination
CLEAR Reset the destination with all 0's
SET Set the destination with all 1's
41
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Arithmetic Instruction Action by processor
ADD Add; calculate sum of two operands
ADC Add with carry; calculate the sum of
operands and the 'carry' bit
SUB Subtract; calculate the difference of two
numbers
SUBB Subtract with borrow; calculate the
difference with 'borrow'
MUL Multiply; calculate the product of two
operands
DIV Divide; calculate the quotient and
remainder of two numbers
NEG Negate; change sign of operand
INC Increment; add 1 to operand
DEC Decrement; subtract 1 from operand
SFIFTA Shift arithmetic; shift the operand
(left or right) with sign extension
Logical Instruction Action by processor
NOT Complement the operand
OR Perform bit-wise logical OR of operands
AND Perform bit-wise logical AND of operands
XOR Perform bit-wise 'exclusive OR' of operands
SHIFT Shift the operand (left or right) filling the
empty bit positions as 0's
ROT Rotate; shift the operand (left or right) with
wrap-around
TEST Test for specified condition and set or reset
relevant flags
42
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Control
transfer
Instruction Action by processor
JUMP Branch; enter the specified address into
Program Counter (PC)
JUMPIF Branch on condition; enter the specified
address into PC only if the specified
condition is satisfied; conditional transfer
JUMPSUB CALL; save current 'program control
status' (into stack) and then enter the
specified address into PC
RET RETRURN; unsave (restore) 'program
control status' (from stack) into PC and other
relevant registers and flags
INT Interrupt; create a software interrupt; save
'program control status' (into stack) and
enter the address corresponding to the
specified vector code into PC
IRET Interrupt return; restore (unsave) 'program
control status' (from stack) into PC and other
relevant registers and flags
LOOP Iteration; decrement the implied register by 1
and test for non-zero; if satisfied, enter the
specified address into PC
43
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Input-output Instruction Action by processor
IN Input; read data from the specified input port /
device into specified or implied register
OUT Output; write data from specified or implied
register into an output port/device
TEST I/O Read the status from I/O subsystem and set
condition flags (codes)
START
I/O
Inform the I/O processor (or the data channel)
to start the I/O program consisting of
commands for the I/O operations
HALT I/O Inform the I/O processor (or the data
channel) to abort the I/O program
consisting of commands for the I/O
operations under progress
String
manipulation
Instruction Action by processor
MOVS Move byte or word of string
LODS Load byte or word of string
CMPS Compare byte or word of strings
STOS Store byte or word of string
SCAS Scan byte or word of string
Translate Instruction Action by processor
XLAT Translate; convert the given code into
another by table lookup
PACK Convert the unpacked decimal number into
packed decimal
UNPACK Convert the packed decimal number into
unpacked decimal
44
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Processor
control
Instruction Action by processor
HLT Halt; stop instruction cycle (processing)
STI (EI) Set/enable interrupt; sets interrupt enable
flag to '1', so as to allow maskable interrupts
CLI (DI) Clear/disable interrupt; resets interrupt
enable flag to '0' so as to ignore maskable
interrupts
WAIT Freeze instruction cycle till a specified
condition, such as an input signal becoming
active, is satisfied
NOOP No operation; no action
ESC Escape; the next instruction after the ESC is
to be skipped since it is meant for the
coprocessor
LOCK Reserve the bus, and hence the memory,
till the next instruction, following the LOCK
instruction, is executed/completed
CMC Complement 'carry' flag
CLC Clear 'carry' flag
STC Set 'carry' flag
2.2.2 Operation codes
There are a number of ways to allocate opcodes to an instruction
[11]. The design issue is to reduce the number of bits in the instruction
(small bit budget) while providing a large number of opcodes for a rich
instruction set. Following three design techniques have been used to meet
these requirements:
45
1. A fixed-length opcode allocated to variable length instructions as in
IBM S370 (Figure. 2.1)
2. A variable-length opcode provided by opcode expansion, allocated
in a variable-length instructions as in Intel x86 (Figure. 2.2)
3. A variable-length opcode provided by opcode expansion, allocated
in a fixed-length instruction as in MIPS32 (Figure. 2.3).
2.2.3 Addressing modes
Addressing mode is the method by which the location of an
instruction is specified within an instruction. Table 2.2 defines popular
addressing modes. A given ISA may not support all the addressing modes.
Table 2.2: Addressing modes and mechanisms
Addressing
mode Mechanism Remarks/examples
Implied
addressing
Operand address is not specified
explicitly
RET and IRET
Immediate
addressing
Operand is given in the
instruction
Fast operand fetch
but operand size is
limited as it increases
instruction length
Direct
addressing
(Absolute
addressing)
Operand is in a memory location;
its address is given in the
instruction
One memory access
required to get the
operand
Indirect
addressing
Operand is in a memory location;
its address is also in memory;
address of the location
containing the operand address
is given in the instruction
Two memory
accesses are
required to get the
operand
46
Table 2.2 (Continued)
Addressing
mode Mechanism Remarks/examples
Register
direct
addressing
Operand is in a register; the
register address/number is given
in the instruction
Faster operand fetch
compared to direct
addressing
Register
indirect
addressing
Operand is in memory; its
address in a register;
address/number of the register is
given in the instruction
Faster operand fetch
than indirect
addressing
Base
register
addressing
Operand is in memory; its
address is specified in two parts;
the instruction gives an offset
number and also specifies the
base register; the offset (integer
number) has to be added to the
base register contents
Useful in relocation
of programs
PC-relative
addressing
Similar to base register
addressing, but the register
always being the PC
Mostly used by
branch instructions
Index
addressing
The operand is in memory; the
instruction gives an address, and
the index register contains an
offset number; the address and
the offset number are added to
get the operand address
Convenient for
indexing arrays
47
Figure. 2.1: IBM S370 Instruction Formats
48
Figure. 2.2: INTEL Pentium Pro Instruction Formats
Figure. 2.3: MIPS32 Instruction Formats
49
2.2.4 Data types
Application programs may use various types of data depending on
the problem. A machine language program can operate either on numeric
data or on non-numeric data. The numeric data can be either binary or
decimal number. The non-numeric data can be any of the following types:
characters, addresses, and logical data. All non-binary data is represented
inside a computer in the binary coded form. The binary data can be
represented either as a fixed-point or a floating-point number. In fixed-point
number representation, the position of a binary number is rigidly fixed in
one place. In floating-point number representation, the binary point's
position can be anywhere. The fixed-point numbers are known as integers
whereas the floating-point numbers are known as real numbers. Arithmetic
operations on fixed-point numbers are simple and they require minimum
hardware circuits. The floating-point arithmetic is complex and requires
extensive hardware circuits. Compared to fixed-point numbers, the floating-
point numbers have two advantages:
1. The maximum or minimum value that can be represented in
floating-point number representation is higher. Hence it is
useful in dealing with very small or very large numbers.
2. The floating-point number representation leads to better
accuracy in arithmetic operations.
2.2.5 ISA Models
There are several different ISA models that architectures are based
upon, each with its own specifications for the various features. The most
commonly implemented ISA models are application-specific, general
purpose and instruction level parallel. Application-Specific ISA Models
define processors that are intended for specific embedded applications,
such as processors made only for TVs. General-purpose ISA models are
50
typically implemented in processors targeted to be used in a wide variety of
systems, rather than only in specific types of embedded systems. CISC
model and RISC model are the common types of general-purpose ISA
architectures implemented in embedded processors. Many current
processor designs fall under the CISC or RISC category primarily because
of their heritage. RISC processors have become more complex, while CISC
processors have become more efficient to compete with their RISC
counterparts, thus blurring the line between the definition of a RISC versus
a CISC architecture. Technically, these processors have both RISC and
CISC attributes, regardless of their definitions. Instruction-level parallelism
ISA architectures are similar to general-purpose ISAs, except that they
execute multiple instructions in parallel, as the name implies. Examples of
instruction-level parallelism ISAs [9] include SIMD model, Superscalar
model, and VLIW model.
2.3 PROCESSOR PERFORMANCE AND ADVANCED ARCHITECTURES
The performance of a processor is measured by the amount of time
taken by the processor to execute a program. The processor performs an
instruction cycle for each instruction. Table 2.3 illustrates the actions taken
at various steps of the instruction cycle for ADD instruction. Elementary
operations performed by the processor during instruction cycle execution
are known as micro-operations. A given micro-operation takes place when
the corresponding control signal is issued by the processor. Table 2.4
illustrates some sample micro-operations performed by the processor. The
time taken for executing different instructions is not the same. Hence the
type of instructions executed in a program and the number of instructions
executed by the processor, while running the program, decides the time
taken by the processor to execute a program.
51
Table 2.3: Instruction cycle steps and actions for ADD instruction
Sl.
No. Step
Action
responsibility Remarks
Parameter
affecting
performance
1 Instruction
fetch
Control unit;
external action
Fetches next
instruction from
main memory
memory
access time
2 Instruction
decode
Control unit;
internal action
Analyses opcode
pattern in the
instruction and
identifies the exact
operation specified
decode time
3 Operand
fetch
Control unit:
external
(memory) or
internal action
depending on
the location of
operands
Determines the
operand addresses
and then fetches
the operands, one
by one, from main
memory or CPU
registers and
supply them to ALU
(1) operand
address
calculation
time
(2) Register/
memory
access time
4 Execute
(ADD)
ALU; internal
action
Specified arithmetic
operation is done
Addition time
5 Result
store
Control unit;
external or
internal action
Stores the result in
memory or
registers
Register/
memory
access time
52
Table 2.4: Sample micro-operations
Sl.
no.
Control
signal
Micro-operation Remarks
1 MAR← PC Contents of PC are copied
(transferred) to Memory Address
Register (MAR)
The first micro-
operation in
instruction fetch
2 PC← PC + 4 Contents of PC are incremented
by 4
The PC always
points to next
instruction
address
3 IR ←MBR Contents of Memory Buffer
Register (MBR) are copied to
Instruction Register (IR)
The last micro-
operation in
instruction fetch
4 MBR ←R2 Contents of R2 register are
copied to MBR
The first micro-
operation in result
store
The following equation is commonly used for expressing a
computer's performance ability:
time time cycles instructions
program cycle instruction program
In other words, the execution time is given by the following equation:
Tp = Nie X CPI/F (2.4)
where Nie is the number of instructions executed (and not the number of
instructions present in the program), CPI is the average number of clock
cycles needed for an instruction, and F is the clock frequency. The CISC
approach attempts to minimize the number of instructions per program,
sacrificing the number of cycles per instruction. RISC does the opposite,
reducing the cycles per instruction at the cost of the number of instructions
per program.
(2.3)
53
For any specific computer, there are two simple measurements that
give us an idea about its performance:
1. Response time or execution time: This is the time taken by the
computer to execute a given program – from the start to the
end of completion of the program. The response time for a
program is different for different computers.
2. Throughput: This is the work done (total number of programs
executed) by the computer during a given period of time.
2.3.1 Instruction Pipelining
In a simple processor (scalar, non-pipelined), the steps of an
instruction cycle are sequentially performed one after the other and
execution of successive instructions are also done sequentially, one after
the other. Instruction pipelining (Figure. 2.4) is a technique in which
execution of successive instructions are overlapped. The goal is to
increase the total number of instructions executed in a given period of time.
In a pipelined processor, different sections of the processor perform
different steps of the instruction cycle for different instructions at a given
time. Each step is called a pipe stage. All the pipe stages together form a
pipe.
Figure. 2.4: A six stage instruction pipeline
54
In a six stage instruction pipeline, six instructions can be active
simultaneously. If it is assumed that all instructions are independent of
other instructions, then for each clock cycle, one instruction can be
completed due to overlap of instruction cycles of consecutive instructions.
In practice, three types of hazards - data, structural, and control - reduce
the pipeline efficiency [9].
Dependencies between instructions are a property of programs. If
two instructions are dependent, they should not be executed
simultaneously. They may be partially overlapped. Two instructions may be
either directly data dependent or indirectly data dependent through another
instruction due to chain of dependencies. In case of dependence, there are
two possible solutions:
1. Preserving the dependence but preventing a hazard
2. Removing the dependence by transforming the object code.
Techniques used for detecting and preventing hazards should
preserve program order so that the overall behaviour and results of the
program are not affected.
2.3.2 RISC Instructions and Pipelining
Though pipelining can be implemented in both CISC and RISC types
of processors to enhance performance, it is simpler to design a pipelined
RISC processor. The following properties of RISC architecture help in
simplifying the pipeline design:
1. All instructions are of equal size, say 4 bytes.
2. Instruction formats are not many; just 1 to 3.
3. Arithmetic and other operations on data always have operands
(data) in registers (not in memory).
4. Only load and store instructions can access memory.
55
Generally RISC processors have three types of instructions: ALU
instructions, Load and store instructions and Branch and Jump type
instructions. In ALU Instructions, the operands are available in registers.
On completion, the results should be stored in registers. In load and store
instructions, one operand is in register and the other operand is in memory.
The address of the memory operand is generally specified as the sum of
two parts: the base register contents and the offset indicated by the
immediate field in the instruction. In branches and jumps, the branch
conditions are usually specified in one of the two ways:
1. Comparison of two items in registers
2. Condition bits or condition codes
Unconditional jumps are present in almost all RISC processors.
Traditional RISC pipeline has five stages as shown in Figure. 2.5 (a).
Figure. 2.5 (b) shows timing diagram while executing 6 instructions over 10
clock cycles. Figure. 2.5 (c) shows the RISC pipeline as a series of data
paths shifted in time.
Figure. 2.5 (a): Five stage pipeline
56
Figure. 2.5 (b): Timing Diagram
CC- Code Cache (Instruction memory); R-Registers; ALU-Arithmetic Logic
Unit; DC-Data Cache (data memory)
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
1 2 3 4 5 6 7 8 9 10
Time in Clock cycles
Pro
gra
m e
xec
uti
on s
equen
ce
Figure. 2.5 (c): RISC Pipeline as a series of datapaths
57
Tradeoffs in micro architecture have changed somewhat since the
RISC five-stage pipeline [7]. In the early RISC days, transistor count
limitations convinced the designers to reuse the ALU for address
computations. Today, transistors are almost free of cost but wires are
expensive. Each additional pipeline stage has a marginal benefit in terms of
spreading out the work in smaller steps that may allow a lower cycle time,
and a marginal cost in terms of added design complexity and global
overheads. Table 2.5 defines the clock cycles, respective stages of
instruction cycle and micro operations. Actual number of clock cycles
required for different instructions are as follows:
Unconditional branch instruction: 2 (cycles 1 and 2)
Store instruction: 4 (cycles 1 to 4)
Any other instruction: 5 (cycles 1 to 5)
There are many alternate design options offering varying
performance levels. The designer chooses the best option taking into
account the hardware cost and required performance level.
There are two major problems in a practical pipeline:
1. Resource Conflict: Two different operations at two
sections/stages may need the same hardware resource in the
same clock cycle, due to overlapping of instructions. To resolve
this, multiple resources of the same type can be provided in the
hardware. This will increase the cost and hence should be
done judiciously.
2. Interference between adjacent stages: Two instructions in
different stages of the pipeline should not interfere with each
other. To resolve this, pipeline registers are used between
successive stages of the pipeline. The pipeline registers are
named indicating the stages linked by them such as IF/ID,
58
ID/EX, EX/MEM and MEM/WB. The result of any specific stage
is stored in the pipeline register at the end of a clock cycle.
During the next clock cycle, the contents of the pipeline register
serve as input to the next stage. In some cases, the result
generated by one stage may not be used as input to the next
stage. It may propagate through more than one stage. For
example, for a STORE instruction, the result is produced in the
ID stage but it is stored in memory only in the MEM stage.
Table 2.5: Typical instruction cycle phases in RISC processors
Sl.
no.
Clock
cycle
Instruction
cycle phase
Major micro
operations
Hardware
sections involved
1 1 Instruction
Fetch (IF)
a. Send PC contents
to memory
b. Fetch the current
instruction from
memory
c. Increment PC by 4
to indicate the next
instruction address
a. Cache memory
2 2 Instruction
Decode (ID);
plus Register
Read cycle
a. Decode the
instruction
b. Read the contents
of source registers
c. Compare the
contents of registers
(as preparation for
certain instructions
such as compare)
a. Instruction
decoder
b. Registers
c. Adder /
comparator
59
Table 2.5 (Continued)
Sl.
no.
Clock
cycle
Instruction
cycle phase
Major micro
operations
Hardware
sections involved
3 3 Execution
(EX); plus
Effective
address cycle
a. For ALU instruction,
the specified
operation is done by
the ALU
b. For memory
reference instruction
(Load/store), the
effective address is
calculated by ALU by
adding the base
register contents and
the offset.
c. For branch
instruction, testing of
branch condition is
done.
a. ALU
b. ALU
c. ALU
4 4 Memory
Access
(MEM); plus
branch
completion
a. For load instruction,
memory read
operation from the
effective address is
done.
b. For store
instruction, memory
write operation at the
effective address,
storing the contents of
source register
c. For branch
instruction, the branch
address is entered in
PC if branch occurs.
a. Cache memory
b. Cache memory
5 5 Write – back
(WB)
a. The result is stored
in the destination
register for load
instruction and ALU
instruction.
a. Registers
60
2.3.3 Superscalar processor
In a scalar pipelined processor, though there are multiple
instructions simultaneously active in the pipeline, there is only one
execution unit/functional unit. Hence at a given time, only one instruction
can be in the execution unit. In a superscalar architecture, there are
multiple pipelines in the processor and hence two or more instructions can
be executed simultaneously. In other words, in a superscalar processor,
same type of operation (add, shift etc.) can be executed simultaneously in
single clock cycle on multiple pipelines for different instructions. Figure. 2.6
shows the organization of a superscalar processor with two pipelines [9]. In
some superscalar processors, instruction sequencing is static (at
compilation time) but in majority of superscalar processors, it is dynamic (at
run time). The control unit in a dynamic superscalar processor is a complex
one whereas in a static superscalar processor, the compiler is a complex
one.
2.3.4 Very Long Instruction Word (VLIW) Processor
The VLIW architecture exploits Instruction Level Parallelism (ILP)
with close cooperation between the compiler and the processor. The
processor has multiple functional units similar to a dynamic superscalar
processor but scheduling is done by the compiler that groups several
independent operations into a very long instruction word. Each VLIW has
multiple fields/slots with each slot containing one RISC like operation. Each
operation corresponds to a functional unit. During the execution of a VLIW,
the processor performs all the operations in parallel in different functional
units. Figure. 2.7 illustrates the principle of a VLIW processor [9].
61
OF-Operand Fetch IF- Instruction Fetch EX-Execute SR-Store Results
2 instructions
Instruction queue
EU-1
Odd instruction
EU-2
EU-Execute unit
Write buffers
Cache
Memory
MAIN MEMORY
System Bus
Unified cache
2 instructions
Even instruction
OF
EX
SR SR
EX
OF
RE
GIS
TE
RS
I F Unit
Decode
and
dispatch
Result
Figure. 2.6: Superscalar Processor Organisation
62
Instruction Cache Memory
add mul load store cmp branch mulfl addfl
INT INT MAU 1 MAU 2 INT Branch FLOAT FLOAT
ALU MUL/DIV ALU unit MUL/DIV ADDER
AAADDER
Integer RF Floating
Point RF
Bus Interface Data Cache
IR
FUs
MAU
System Bus
IR-Instruction Register FU-Functional Unit
RF-Register File
INT-Integer
MAU-Memory Addressing Unit
(a) Inside VLIW Processor
add mul load store cmp branch mulfl addfl
add R1 R2
256 bits
32 bits
(b) VLIW and one operation
Figure. 2.7: VLIW Processor Organisation
63
2.3.5 Cache Memory
The cache memory is a small and fast intermediate buffer between
the processor and the main memory with the objective of reducing the
processor's waiting time during main memory access. The presence of
cache memory is not known to application programs. Figure. 2.8 illustrates
the use of cache memory.
Figure. 2.8: Use of Cache memory
The main memory is conceptually divided into many blocks, each
containing a fixed number of consecutive locations. The cache memory is
organized as number of lines and the size of each line is same as the
capacity of main memory block. The cache operation is based on locality of
reference [23], a property inherent in programs. Most of the times,
processing requirement is such that instructions or data needed are
available in those main memory locations which are physically close to the
current main memory location being accessed. There are two kinds of
behaviour pattern:
1. Temporal locality: A recently accessed memory location is
likely to be accessed again.
64
2. Spatial locality: The neighbouring location to the recently
accessed memory location is likely to be accessed.
In view of these two properties, while reading a location from main
memory, the content of entire block is transferred and stored in cache
memory. There are more blocks in main memory than the number of lines
in cache memory. Hence a mapping function is followed by the cache
controller to systematically map any main memory block to one of the
cache lines. When the processor needs a memory operand, the cache
controller checks the cache memory to find out if the current main memory
address is already mapped onto cache. If it is mapped, it means the
required item is available in cache memory and this condition is called
'cache hit'. Then the required information is read from cache memory.
On the other hand, if the current main memory address is not
mapped in cache memory, the required information is not available in
cache memory and this situation is known as 'cache miss'. In this case, the
entire block containing the main memory address is brought into the cache
memory. The time taken to bring the required item from the main memory
and supply it to the processor is known as 'miss penalty'. The hit rate (also
known as hit ratio) provides the fraction of the number of accesses which
faced 'cache hit' to the total number of accesses.
The cache memory is of two types: Unified cache or common cache,
and Split cache. The unified cache stores both instructions and data. In
split cache, there is a separate instruction cache (also known as code
cache) and data cache. Some computers use a two level or three level
cache memory system. The cache immediately next to the processor is
known as level 1 cache or primary cache. The next level cache is called a
level 2 cache or secondary cache. Most microprocessors are incorporating
multi-level caches on-chip.
65
2.3.6 Virtual Memory
Virtual memory concept facilitates the execution of large programs in
systems with smaller physical memory. Virtual memory is desirable in the
following two cases:
1. The logical memory space of the processor is small
2. The physical main memory space has to be kept small to
reduce the cost though the processor has large logical memory
space.
Figure. 2.9 illustrates the concept of virtual memory. In virtual
memory system, the OS automatically manages the long programs by
storing the entire program on a large hard disk. At a given time, only some
portions of the program are stored in main memory. During the execution of
the program, different portions of the program are swapped between the
main memory and hard disk on need basis. The program does not address
the physical memory directly.
CM - Cache memory; optional unit.
Figure. 2.9: Virtual memory concept
66
While referring to an instruction or operand, it provides the logical
address, and the virtual memory hardware (also known as memory
management unit or MMU) in the processor translates it into the equivalent
physical memory address [9]. There are two popular methods in virtual
memory implementation: paging and segmentation. In paging, the system
software divides the program into pages of equal sizes. In segmentation,
the machine language programmer organizes the program into different
segments which need not be of same size. Figure. 2.10 illustrates the
mechanism of virtual memory.
Figure. 2.10: Virtual memory mechanism
2.3.7 Multicore CPU
Building a high performance computer system by linking together
several low performing computers is a standard technique of achieving
parallelism. This idea is the basis for development of multiprocessor
systems. Designing a microcomputer using multiple single-chip
microprocessors has been a cost-effective strategy for several years in the
past. The latest trend is the design of multicore microprocessors resulting
in quantum change in the way multiprocessor systems are developed and
67
used for various applications [10]. Figure. 2.11 illustrates the concept of
muticore with four cores in a single die. Figure. 2.12 illustrates the
organization of SPARC 64 VII, a popular quad core CPU.
Figure. 2.11: A Quad-core CPU
Figure. 2.12: SPARC64 VII Processor
Chip Multiprocessing technology is an architecture in which multiple
physical cores are integrated on a single processor module. Each physical
core runs a single execution thread of a multithreaded application
independently from other cores at any given time. With this technology,
multi-core processors offer several times the performance of single-core
68
modules. The ability to process multiple instructions at each clock cycle
provides the performance advantage, but improvements also result from
the short distances and fast bus speeds between chips as compared to
traditional CPU to CPU communication in a multiprocessor system.
2.4 EMBEDDED PROCESSORS
Processors are the main functional units of an embedded system,
and are primarily responsible for processing instructions and data. An
embedded system contains at least one master processor, acting as the
central controlling device, and can have additional slave processors that
work with and are controlled by the master processor. These slave
processors may either extend the instruction set of the master processor or
act to manage buses and input/output (I/O) devices. The complexity of the
master processor usually determines whether it is classified as a
microprocessor or a microcontroller. Traditionally, microprocessors contain
a minimal set of integrated memory and I/O components, whereas the
microcontrollers have most of the system memory and I/O components
integrated on the chip. However, these traditional definitions are becoming
somewhat inaccurate in view of convergence taking place in recent
processor designs. There are literally hundreds of embedded processors
available and these can be grouped into various architectures [6]. What
differentiates one processor group's architecture from another is the set of
machine code instructions that the processors within the architecture group
can execute. Processors are considered to be of the same architecture
when they can execute the same set of machine code instructions. Table
2.6 lists some examples of real-world processors and the architecture
families they fall under. Table 2.7 lists the merits and demerits of different
types of processors that can embed in a complex embedded system [8].
69
Table 2.6: Typical Embedded Architectures and Processors
Architecture Processor Manufacturer
AMD Au1xx Advanced Micro Devices
ARM ARM7, ARM9, ... ARM, ....
ColdFire 5282, 5272, 5307, 5407, ... Motorola/Freescale, ...
M32/R 32170, 32180, 32182,
32192, ...
Renesas/Mitsubishi, ...
MIPS32 R3K, R4K, 5K, 16, ... MT14kx, IDT, MIPS
Technologies, ...
NEC Vr55xx, Vr54xx, Vr41xx NEC Corporation, ...
PowerPC 82xx, 74xx, 8xx, 7xx, 6xx,
5xx, 4xx
IBM, Motorola/Freescale, ...
SuperH (SH) SH3, SH4 Hitachi, ...
SHARC SHARC Analog Devices, Transtech
DSP, Radstone, ...
strongARM strongARM Intel, ...
SPARC UltraSPARC II Sun Microsystems, ...
TMS320C6xxx TMS320C6xxx Texas Instruments
x86 X86 [386, 486, Pentium(II,
III, IV)...]
Intel, Transmeta, National
Semiconductor, Atlas, ...
Tricore Tricore1, Tricore2, ... Infineon, ...
70
Table 2.7: Processor types in Complex Embedded Systems
Processor type Application Advantage Disadvantage
General purpose
microprocessor
When intensive
computations are
required and large
embedded software
is located in the
external memory
cores or chips
No engineering
cost in
designing the
processor
Additional redundant
execution units that
are not needed in the
given system design
Microcontroller Used with internal
memory, devices
and peripherals and
when embedded
software is located
in the internal ROM
or flash memory
No engineering
cost in
designing the
processor
Additional
manufacturing costs
and redundant
application units
which are not
needed in the given
system design
DSP Used with signal
processing-related
instructions for
filters, image, audio,
and video and
CODEC applications
No engineering
cost involved in
designing the
signal
processor
Manufacturing cost
may be high
Single purpose
processors and
application
specific system
processor
Control I/O and bus
operations and
peripherals and
devices
They support
other
processing
units in the
system and
execute
specific
hardware
processes fast
In-house engineering
cost of development,
royalty payments for
an IP core of
processor and time-
to-market cost
Multicore
processor
To significantly
enhance the
performance of the
system
Reduced
engineering
cost
Increased
manufacturing cost
Accelerator To accelerate the
execution of codes.
A floating point
coprocessor
accelerates
mathematical
operations and Java
accelerator
accelerates Java
code execution.
Increases
performance by
co-processing
with the main
processor
Increased
engineering cost of
development or
royalty payments for
the IP core of
processor
71
2.5 EMBEDDED SYSTEM ARCHITECTURES
Embedded computer systems range from everyday machines - most
of the microwaves and washing machines, printers, network switches, and
automobiles - to handheld digital devices (such as PDAs, cell phones, and
music players) to videogame consoles and digital set-top boxes. Except in
some applications such as PDAs, in many embedded applications, the only
programming occurs at developer's site in connection with the initial loading
of the application code or a later software upgrade of that application. Thus,
the application is carefully tuned for the processor and system [3].
Embedded systems often process information in different ways from
general-purpose processors. Typically these applications include deadline-
driven constraints—so-called real-time constraints. In these applications, a
particular computation must be completed by a certain time limit failing
which the system will malfunction. A real-time performance requirement is
one where a segment of the application has an absolute maximum
execution time that is allowed. For example, in a digital set-top box the time
to process each video frame is limited, since the processor must accept
and process the frame before the next frame arrives (typically called hard
real-time systems). In some applications, a more liberal requirement exists:
the average time for a particular task is constrained as well as is the
number of instances when some maximum time is exceeded. Such
approaches (typically called soft real-time) arise when it is possible to
occasionally miss the time constraint on an event, as long as not too many
are missed. Real-time performance tends to be highly application
dependent.
Embedded system applications typically involve processing
information as signals that may be an image, a motion picture composed of
a series of images, a control sensor measurement, and so on. Signal
72
processing requires specific computation that many embedded processors
are optimized for.
Two other key characteristics exist in many embedded applications:
the need to minimize memory and the need to minimize power. The
importance of memory size translates to an emphasis on code size, since
data size is dictated by the application. Some architecture has special
instruction set capabilities to reduce code size. Larger memories also mean
more power, and optimizing power is often critical in embedded
applications. Although the emphasis on low power is frequently driven by
the use of batteries, the need to use less expensive packaging (plastic
versus ceramic) and the absence of a fan for cooling also demand reduced
power consumption.
Often an application’s functional and performance requirements are
met by combining a custom hardware solution together with software
running on a standardized embedded processor core, which is designed to
interface to such special-purpose hardware. In practice, embedded
problems are usually solved by one of three approaches:
1. The designer uses a combined hardware/software solution that
includes some custom hardware and an embedded processor
core that is integrated with the custom hardware, often on the
same chip.
2. The designer uses custom software running on an off-the-shelf
embedded processor.
3. The designer uses a digital signal processor and custom
software for the processor.
Embedded systems are a very broad category of computing devices.
For example, the TI 320C55 DSP is a relatively “RISC-like” processor
designed for embedded applications, with very fine-tuned capabilities. On
73
the other end of the spectrum, the TI 320C64x is a very high-performance,
eight-issue VLIW processor for very demanding tasks. Media extensions
attempt to merge DSPs with some more general-purpose processing
abilities to make these processors usable for signal processing
applications. Hennessy and Patterson have examined [3] several case
studies, including the Sony PlayStation 2, digital cameras, and cell phones.
The PlayStation2 performs detailed three-dimensional graphics, whereas a
cell phone encodes and decodes signals according to elaborate
communication standards. But both have system architectures that are very
different from general-purpose desktop or server platforms. In general,
architectural decisions that seem practical for general-purpose applications,
such as multiple levels of caching or out-of-order superscalar execution,
are much less desirable in embedded applications. This is due to chip area,
cost, power, and real-time constraints. The programming model that these
systems present places more demands on both the programmer and the
compiler for extracting parallelism.
2.5.1 Digital Signal Processor
A digital signal processor (DSP) is a special-purpose processor
optimized for executing digital signal processing algorithms [5]. Most of
these algorithms, from time-domain filtering (e.g., infinite impulse response
and finite impulse response filtering), to convolution, to transforms (e.g.,
fast Fourier transform, discrete cosine transform), to even forward error
correction (FEC) encodings, all have as their kernel the same operation: a
multiply-accumulate operation. Either transform has as its core the sum of
a product. To accelerate this, DSPs typically feature special-purpose
hardware to perform multiply-accumulate (MAC). A MAC instruction of
“MAC A, B, C” has the semantics of “A = A + B * C.” In some situations, the
performance of this operation is so critical that a DSP is selected for an
application based solely upon its MAC operation throughput. DSPs often
employ fixed-point arithmetic. In addition to MAC operations, DSPs often
74
also have operations to accelerate portions of communications algorithms.
An important class of these algorithms revolve around encoding and
decoding forward error correction codes—codes in which extra information
is added to the digital bit stream to guard against errors in transmission. At
one end of the DSP spectrum is the TI 320C55 architecture optimized for
low-power, embedded applications with a seven-staged pipelined CPU.
The source of input data to DSP is some form of digitized signal, like
a photo image captured by a digital camera, a voice packet going through a
network router, or an audio clip played by a digital keyboard. As with
microcontrollers, DSPs also tend to incorporate many peripherals that are
useful in signal processing on a single IC. For example, a DSP device may
contain a number of analog-to-digital and digital-to-analog converters,
pulse-width modulators, direct memory access controllers, timers, and
counters.
2.5.2 Media Extensions
Media Extensions is a middle ground between DSPs and
microcontrollers. These extensions add DSP-like capabilities to
microcontroller architectures at relatively low cost. Because media
processing is judged by human perception, the data for multimedia
operations are often much narrower than the 64-bit data word of modern
desktop and server processors. For example, floating-point operations for
graphics are normally in single precision, not double precision, and often at
a precision less than specified by IEEE 754. Rather than waste the 64-bit
arithmetic-logical units (ALUs) when operating on 32-bit, 16-bit, or even8-
bit integers, multimedia instructions can operate on several narrower data
items at the same time. Thus, a partitioned add operation on 16-bit data
with a64-bit ALU would perform four 16-bit adds in a single clock cycle. The
extra hardware required is only to prevent carries between the four 16-bit
partitions of the ALU. For example, such instructions might be used for
75
graphical operations on pixels [10]. These operations are commonly called
single-instruction multiple data (SIMD) or vector instructions. Most graphics
multimedia applications use 32-bit floating-point operations.
2.5.3 Embedded Multiprocessors
In the embedded space, a number of special-purpose designs have
used customized multiprocessors; including the Sony PlayStation
2[7].Many special-purpose embedded designs consist of a general-purpose
programmable processor or DSP with special-purpose, finite-state
machines that are used for stream-oriented I/O. In applications ranging
from computer graphics and media processing to telecommunications, this
style of special-purpose multiprocessor is becoming common. Although the
inter-processor interactions in such designs are highly regimented and
relatively simple—consisting primarily of a simple communication
channel—because much of the design is committed to silicon, ensuring that
the communication protocols among the input/output processors and the
general-purpose processor are correct is a major challenge in such
designs. As a recent trend, embedded multiprocessors are built from
several general-purpose processors. These multiprocessors have been
focused primarily on the high-end telecommunications and networking
market, where scalability is critical. An example of such a design is the
MXP processor designed by empowerTel Networks for use in voiceover-IP
systems. The MXP processor consists of four main components:
1. An interface to serial voice streams, including support for
handling jitter
2. Support for fast packet routing and channel lookup
3. A complete Ethernet interface, including the MAC layer
4. Four MIPS32 R4000-class processors, each with its own cache
(a total of 48 KB or 12 KB per processor)
76
The MIPS processors are used to run the code responsible for
maintaining the voice-over-IP channels, including the assurance of quality
of service, echo cancellation, simple compression, and packet encoding.
Since the goal is to run as many independent voice streams as possible, a
multiprocessor is an ideal solution. Because of the small size of the MIPS
cores, the entire chip takes only 13.5Mtransistors. Future generations of
the chip are expected to handle more voice channels, as well as do more
sophisticated echo cancellation, voice activity detection, and more
sophisticated compression.
Multiprocessing is becoming widespread in the embedded
computing arena for two primary reasons. First, the issues of binary
software compatibility, which plague desktop and server systems, are less
relevant in the embedded space. Often software in an embedded
application is written from scratch for an application or significantly
modified. Second, the applications often have natural parallelism,
especially at the high end of the embedded space. Examples of this natural
parallelism abound in applications such as a settop box, a network switch,
a cell phone or a game system. The lower barriers to use of thread-level
parallelism together with the greater sensitivity to die cost (and hence
efficient use of silicon) are leading to widespread adoption of
multiprocessing in the embedded space, as the application needs grow to
demand more performance.
Desktop computers and servers rely on the memory hierarchy to
reduce average access time to relatively static data, but there are
embedded applications where data are often a continuous stream. In such
applications there is still spatial locality, but temporal locality is much more
limited. The steady stream of graphics and audio demanded by electronic
games lead to a different approach to memory design. The style is high
bandwidth via many dedicated independent memories.
77
2.6 MIPS32 Vs OTHER RISC PROCESSORS
Although the modern version of the RISC design dates to the 1980s,
a number of systems of the 1970s have been credited as the first RISC
architecture, partly based on their use of load/store approach. For
example, the CDC 6600 designed by Seymour Cray in 1964 used a
load/store architecture with only two addressing modes (register+register,
and register+immediate constant) and 74 opcodes, with the basic clock
cycle/instruction issue rate being 10 times faster than the memory access
time [24,25].
The modern RISC revolution started with the projects at Stanford
University and University of California, Berkeley and IBM. Stanford's design
led to the successful MIPS architecture, while Berkeley's RISC project has
been commercialized as the SPARC. Another success from this era was
IBM's 801 that eventually led to the Power Architecture. As these projects
matured, a wide variety of similar designs flourished in the late 1980s and
early 1990s, representing a major force in the Unix workstation market as
well as embedded processors in laser printers, routers and similar
products. The Berkeley RISC project delivered the RISC-I processor in
1982. Compared with averages of about 100,000 in newer CISC designs of
the era, the RISC-I, consisting of only 44,420 transistors, had only 32
instructions with three addressing modes, and yet completely outperformed
any other single-chip design. They followed this up with the 40,760
transistor, 39 instruction RISC-II in 1983, which ran over three times as fast
as RISC-I. In 1986, Hewlett Packard started using an early implementation
of their PA-RISC in some of their computers. In the meantime, the Berkeley
RISC effort had become so well known that it eventually became the name
for the entire concept and in 1987 Sun Microsystems began shipping
systems with the SPARC processor, directly based on the Berkeley RISC-II
system.
78
Well-known RISC families include DEC Alpha, AMD 29k, ARC,
ARM, Atmel AVR, Blackfin, Intel i860 and i960, MIPS, Motorola 88000, PA-
RISC, Power (including PowerPC), SuperH, and SPARC. In the 21st
century, the use of ARM architecture processors in smart phones and
tablet computers such as the iPad, Android, and Windows RT tablets
provided a wide user base for RISC-based systems. RISC processors are
also used in supercomputers such as the K computer, the fastest on the
TOP500 list in 2011, and Sequoia, the fastest in 2012 list.
Over the years, RISC instruction sets have grown in size, and today
many of them have a larger set of instructions than many CISC CPUs.
Some RISC processors such as the PowerPC have instruction sets as
large as the CISC IBM System/370, for example; conversely, the DEC
PDP-8—clearly a CISC CPU because many of its instructions involve
multiple memory accesses—has only 8 basic instructions and a few
extended instructions. RISC architectures are now used across a wide
range of platforms, from cellular telephones and tablet computers to some
of the world's fastest supercomputers such as the K computer, the fastest
on the TOP500 list in 2011. As of 2014, a new research ISA, RISC-V, has
been under development at University of California, Berkeley, emphasizing
features such as many core, heterogeneous multiprocessing,
virtualisability, and dense instruction encoding.
2.6.1 CISC and RISC Convergence
State of the art processor technology has changed significantly since
RISC chips were first introduced in the early '80s. Because a number of
advancements are used by both RISC and CISC processors, the lines
between the two architectures have begun to blur. In fact, the two
architectures almost seem to have adopted the strategies of the other.
Since the processor speeds have increased, CISC chips are now able to
execute more than one instruction within a single clock. This also allows
79
CISC chips to make use of pipelining. With other technological
improvements, it is now possible to fit many more transistors on a single
chip. This gives RISC processors enough space to incorporate more
complicated, CISC-like commands. RISC chips also make use of more
complicated hardware, making use of extra function units for superscalar
execution. All of these factors have led some groups to conclude that now
in the present "post-RISC" era, the two architectures have become so
similar that distinguishing between them is no longer relevant. However, it
should be noted that RISC chips still retain some important traits. RISC
chips strictly utilize uniform, single-cycle instructions. They also retain the
register-to-register, load/store architecture. And despite their extended
instruction sets, RISC chips still have a large number of general purpose
registers.
The question of whether ISA plays an intrinsic role in performance or
energy efficiency is becoming important [26]. The traditionally low power
ARM ISA (a RISC) is entering the high performance server market, with the
traditionally high-performance x86 ISA (a CISC) is entering the mobile low-
power device market.
The MIPS architecture that grew out of a graduate course by John L.
Hennessy at Stanford University in 1981, resulted in a functioning system
in 1983, and could run simple programs by 1984. The MIPS approach
emphasized an aggressive clock cycle and the use of the pipeline, making
sure it could be run as "full" as possible. The MIPS system was followed by
the MIPS-X and in 1984 Hennessy and his colleagues formed MIPS
Computer Systems. The commercial venture resulted in the R2000
microprocessor in 1985, and was followed by the R3000 in 1988. The
company was purchased by Silicon Graphics, Inc. in 1992, and was spun
off as MIPS Technologies, Inc. in 1998. Subsequently Imagination
Technologies has bought the company.
80
2.7 MIPS32 INSTRUCTIONS AND CODE WASTAGE
RISC processors generally have three types of instructions: ALU,
Load or store, and Branch and Jump. Though RISC processors have
limited number of addressing modes, there are variations among the
processors. MIPS processor has only two addressing modes: immediate
and displacement, both with 16-bit fields [3].
Figure. 2.3 seen earlier in section 2.2.2 summarises the basic
formats of MIPS32 integer instructions [27] with examples. The length of
the fields in bits is indicated inside brackets. All the instructions are 32-bits
and the most significant six bits contain the opcode. In the I-type and J-type
instructions, the opcode itself indicates the exact operation. In the R-type
instructions, the op field identifies the instruction type and the fn field (least
significant bits 0-5) indicates the exact operation. For example, the six-bit
pattern 000000 in op identifies all R-type instructions and the fn pattern
indicates the exact function i.e., the instruction is add, and, sub, mul, div,
shift etc. For the and instruction, the op is 0x24 whereas for the or
instruction, the op is 0x25. The R-type is for register-to-register operations.
The I-type is for data transfers, branches, and immediate operations. In
load/store type instructions, the offset field is added to the contents of the
rs register, usually an address, to form the effective address for one of the
operands, either the source or destination.
The branch instructions use a signed 16-bit offset field enabling
jump by 215-1 instructions forward or 215 instructions backward. In I-type
arithmetic instructions, the immediate field is sign-extended to 32-bits to
form one of the operands, and the other operand is available in the rs
register. In I-type logical instructions, the immediate field is zero-extended
to form the second operand and the rs register has the first operand. The
J-type is for jumps and the instruction address is identified by the 26-bit
target field. The actual instruction address is a 30-bit address formed by
81
shifting left the target field contents by four bits. There are two more jump
instructions, jr and jalr, which follow different formats and they contain the
instruction address in the rs register and they have no target field.
The drawbacks of RISC instruction formats due to fixed instruction
size feature are as follows:
1. Several bits are unused in many instructions. Table 2.8 lists the
extent of unused bits in six integer instructions of MIPS32 ISA
since all instructions have to be 32 bits.
2. The R-type instructions use totally 12 bits to specify the
operation though there are only maximum of 64 different R-type
operations in MIPS32 ISA.
Table 2.8: Typical Wastage of Bits in MIPS32 Instructions
Instruction Action
No. of
unused
bits
Instruction Action
No. of
unused
bits
Rfe Return
from
exception
19 addu Addition 5
Syscall System call 20 mult Multiply 10
Nop No
operation
20 lui Load upper
immediate
5
3. In immediate type instructions such as addi, 16 bits are used
for specifying the immediate operand. In most cases, 8 bits are
sufficient for the immediate operand and the remaining 8 bits
become redundant. In branch instructions such as beq, the
82
offset field is underutilized in those cases where the offset
required can be specified with 8 bits.
The impact of these drawbacks on the code size has been quantified
in chapter 3 by analysing typical embedded object codes with the help of a
custom built tool. The outcome of this analysis has formed the basis for the
architectural modifications proposed in chapter 4 and chapter 5.
2.8 CODE SIZE REDUCTION IN EMBEDDED SYSTEMS
In embedded applications, every bit of code counts since it directly
affects both the program memory size, and the amount of bit traffic
between the program memory and the processor. Static code size is
directly proportional to cost in terms of program ROM size in embedded
systems. Dynamic code size has repercussions on instruction cache
effectiveness and hence on performance. Depending on the complexity of
the system, the code memory takes beyond 50% of the embedded product.
The instruction fetches take 5 to 15% of the execution time for a typical 32-
bit embedded RISC processor [7]. Since embedded systems are not user
programmable, several techniques are available to the developers, both at
compiler level and hardware level for compressing the original code
generated by the compiler. However, most solutions reduce performance.
Although the goal of this thesis is in favour of redesigning existing RISC
processors, review of philosophy behind these code compression
techniques and the extent of code compression achieved is provided to
help appreciate the benefits of the architectural solution proposed by us.
Several techniques to reduce code size have been implemented
[28]. These are classified into three types [2]: Code compression, Compiler
techniques and Ad hoc ISA modification. The first two techniques retain the
original ISA whereas the third technique involves supporting a new
83
instruction set that is a subset of the original ISA. An overview of these
three techniques is given below.
2.8.1 Code Compression
Code compression, initially applied to single issue processors such
as CISC and RISC, is now used in VLIW processors also. The
compression methods [28] are based on traditional data compression
techniques including entropy encoding, such as Huffman encoding [29] and
arithmetic coding [30,31,32], dictionary-based compression [33], operand
factorization [34], and re-encoding the original RISC instructions, to name a
few. Code compression involves compressing the executable RISC object
code in offline, and storing the compressed code in code memory. The
decompression is done on-the-fly, for each instruction, during program
execution. The decompression unit is placed between the processor core
and memory either as post-cache (between the cache and the processor),
or as pre-cache (between the code memory and the cache) [35]. In the pre-
cache architecture, the code memory contains compressed code but the
instruction cache memory contains uncompressed code. Decompression
occurs whenever there is a cache miss and hence it is not time critical. In
the post-cache architecture, both code memory and instruction cache
contain compressed code. Decompression occurs during every instruction
fetch and hence it is in the critical path of the instruction pipeline.
The criterion to measure the efficiency of a code compression
scheme is compression ratio, which is defined as the ratio of the size of the
compressed program over the size of the original program. A large body of
knowledge is available on lossless compression [36] and hardware for low
power and high performance compression and decompression has been
proposed [37]. However, there are some distinctive requirements [38]. First,
it must be possible to decompress a program during execution, ensuring
random access, starting from several points inside the program, since
84
branch, jump, and call instructions can alter the program execution.
Second, compression and decompression algorithms can be highly
asymmetric because compression can be performed once for all (offline)
when the executable is generated, while decompression is performed
during program execution; thus it should be fast and power efficient
because its hardware cost must be fully amortized by the corresponding
savings in memory size and power, without compromising performance.
The compression methods [28] result in either variable or fixed-width
instructions. Decompression is more complex with variable-width
instruction as the width of the instruction is not known before the
decompression. Normally, the code compression strategy does not require
any modification to the processor architecture. The instruction fetch unit
generates the next instruction address which will be normally the sum of
previous instruction address and the size of the previous instruction. On
encountering a branch, jump, or call instruction, the target address will be
calculated and the target instruction will be fetched from the memory or
cache. If the program memory contains the compressed code, a mapping
between the original address space and the compressed address space is
necessary. Alternate approach [33] requires a two phase action in offline
after compilation. First, compress the whole program, then, patch branch
offsets during a second phase, to point to a compressed code. In this
approach, the processor needs to be modified to handle unaligned
(compressed) branch targets.
Wolfe and Chanin [30, 39] were the first to apply code compression
to embedded systems. Their scheme known as Compressed Code RISC
Processor (CCRP) uses Huffman coding to compress MIPS object codes,
and a Line Access Table (LAT) to map original program block addresses
and compressed code block addresses. The LAT is stored in program
memory. The code memory has compressed code and the code cache
holds the uncompressed code. Compression is done through a software
tool after linking, and the compressed program is placed into a special
85
memory area, identified by the linker as a compressed text segment that
also has a special section for decompression tables. A byte-based Huffman
coding algorithm was used with a cache line as the basic block to be
compressed. A TLB like buffer called Cache line address Lookaside Buffer
(CLB) is introduced to minimise LAT accesses and save time.
Decompression is slower since Huffman codes are of variable length
codes.
The CCRP method established the foundation for the IBM Codepack
compression technology for the PowerPC 400 series [40]. Compressed
code is stored in the external memory and CodePack is placed between
the memory and the cache as illustrated in Figure. 2.13. Decompression is
triggered by an instruction cache miss. The translation between the
compressed and uncompressed lines is held in the LAT. The 32-bit
PowerPC Instructions are divided into two 16-bit parts and two Huffman
tables are used for each piece. The Huffman-like codewords are assigned
on a frequency distribution basis. Words are grouped in sets and words
belonging to the same set have been assigned codewords of the same
length. For each cache miss, Codepack fetches and decompresses two
cache blocks instead of the only one requested. This approach does not
involve compiler modification or processor design change. The original
work of Wolfe and Chanin achieves 30 to 50% compression ratio whereas
IBM CodePack technique gives compression ratio between 36% and 47%.
2.8.2 Dictionary-based Compression
Dictionary- based compression is another compression method
[38,28,41]. It is based on the property that the same instructions with the
same operands reappear in the embedded object code repeatedly. The
compression algorithm creates a dictionary of distinct instructions, and
replaces each instruction in the original program with the corresponding
86
index to the dictionary as illustrated in Figure. 2.14. Thus, the instructions
are substituted by 'codewords'.
Figure. 2.13: IBM Codepack Code Compression for Power PC
As the codeword is smaller than the original instruction, the size of
the code is reduced. During program execution, the codeword (dictionary
index), fetched from the program memory, is used to fetch the original
uncompressed instructions in the dictionary. Figure. 2.15 illustrates the
decompression operation of the dictionary method of compression. Given a
program with N unique instructions, the length of the codeword is [log N]
bits.
87
Figure. 2.14: Dictionary based compression
Figure. 2.15: Decompression procedure for the dictionary based
compression
The dictionary is usually implemented in ROM in the control path of
the processor. Dictionary-based compression is a simple scheme offering
fast decompression. The decompressor is actually a simple table; it can be
integrated with the instruction decoder into a single pipeline stage. Though
this scheme is a straightforward one, offering inexpensive address
88
translation and sizable reduction of memory fetch bandwidth (i.e., number
of bits transferred from code memory to execute a program), [7] argues that
'this approach is the least appealing for an embedded system'. On the
other hand, [39] establishes that the dictionary-based compression is
competitive with CodePack for static footprint compression, and achieves
superior results for bus traffic and energy reduction.
In expression-tree-based algorithms [42] for code compression
proposed by Guido et. al, the encoded symbols are extracted from program
expression trees and dictionary-based decompression engines are
implemented.
2.8.3 Compiler Techniques
Modern embedded compilers are often more complex than general
purpose compilers. A traditional compiler mainly aims to optimize a one-
dimensional cost function represented by the number of cycles needed to
execute a program. On the other hand, for an embedded compiler, code
size and energy are equally important as the speed of execution. Certain
scalar optimizations by traditional compiler are relevant in embedded
systems also. For example, transformations such as dead code elimination,
common sub expression elimination, strength reduction, copy propagation,
and constant folding reduce code, and power consumption apart from
improving speed. However, certain ILP-oriented optimizations such as loop
unrolling, tail duplication, procedure inclining and cloning, speculation, and
global code motion offer better speed but may hurt code size and power
consumption [7]. Research on code compression has been very active in
the compiler community [11, 43] with the goal of finding compact program
representations. Pure software techniques [39] by compiler to reduce
program size and decompress instructions during execution have been
popular among embedded community. Compiler techniques for code
compression for RISC architectures, by Cooper and McIntosh [44] map
89
isomorphic instruction sequences into abstract routine calls or cross-
jumping. A profile-guided code compression to apply Huffman coding to
infrequently executed functions has been suggested by Debray and Evans
[45], [46]. A control flow graph centric software approach to reduce memory
space consumption has been proposed by Ozturk et al [47]. Their approach
involves on-the-fly compression/decompression of object codes of
embedded applications. A flexible decompressor approach, applicable to
multiple platforms, was proposed by Shogan and Chiders [48] with their
implementation of IBM's CodePack algorithms within the fetch step of
Software Dynamic Translator (SDT) in pure software infrastructure. Thus
compiler techniques for code compression involve register renaming, inter
procedural optimization, and procedural abstraction of repeated code
fragments. The procedure abstraction is a program optimization technique
that replaces repeated sequences of common code with calls to a single
procedure. The above compiler techniques are attractive since they have
no runtime decompression overheads, do not require any hardware change
and the code generated can be directly executed by the processor.
However, there is a need to modify the software tools such as compilers
and linkers.
2.8.4 Ad hoc ISA Modification
This approach customizes the existing RISC instruction set
architecture with narrow instructions supporting fewer operations, smaller
operand fields, and fewer registers. For example, the Thumb [49]
instruction set is a modification of the original ARM instruction set (32-bit
instructions). It has 36 different 16-bit instructions which form a subset of
ARM instructions. Similarly in MIPS16, a subset of 32-bit MIPS instructions
are mapped to 16-bit MIPS instructions which can be translated in real-time
into 32-bit MIPS instructions. This approach involves a considerable effort
to design the new instruction set and requires a new instruction decoder, a
new set of software development tools, such as compiler, assembler, and
90
linker. A code saving of up to 40% has been reported. However, the dense
instruction sets often cause performance penalties [39] due to lack of
instructions. Also, the processor hardware needs additional logic for
decoder/decompression to support both ISAs. Both ARM and MIPS have
responded to the first criticism by introducing Thumb2 and microMIPS
processors. The ISAs of these processors support two instruction sizes:
16-bit and 32-bit. Although the performance degradation has been taken
care to certain extent, the processors still have additional
decoder/converter logic to detect the 16-bit instructions and convert them
into 32-bit instructions.
There have been attempts to develop tiny RISC processors [50].
The DMN-6 has 16 registers of 8-bits, executes just 12 instructions and has
no cache memory. Known as Minimal RISC processor, it is meant
exclusively for use in toys.
2.9 ISA LEVEL CODE SIZE REDUCTION
Instructions set architects have broadly used two techniques to
reduce the relative energy cost of instruction stream delivery. One
approach is to increase the amount of work performed by a single
instruction. Vector machines, for example, reduce instruction bandwidth
demands by expressing a large amount of SIMD parallelism in a single
instruction [9]. CISC machines do so by combining multiple simple
operations into a single instruction and providing more addressing modes.
An alternate approach is to reduce the size of the instructions. CISC
instruction sets generally have been composed of variable-length
instructions: the simple and more common ones are usually encoded in
fewer bits than those that require more operands or occur less frequently.
RISC ISAs initially sacrificed the code density advantages of variable-
length instruction encodings in favour of simple, fixed length 32-bit
encodings. Subsequently, RISC instruction set extensions have provided
91
fixed-length 16-bit encodings (as in ARM Thumb and MIPS16), although
often at the expense of performance and limited access to some hardware
features. Next generation RISC ISAs (as in ARM Thumb2, micro MIPS and
RISC-V) partly resolve these drawbacks by encoding the most common
instructions densely, while maintaining most or all of the functionality of the
32-bit ISA. However, these ISAs have not fully resolved the issue of code
density since these ISAs continue giving importance to pipeline design
complexity. Hence they have only two different instruction sizes: two bytes
and four bytes. Still, these are called as variable instruction length ISAs
which is a misnomer and the term hybrid instruction length is the proper
term. On the other hand, hybrid length encoding proposed in this thesis
recommends a new ISA with four different sizes that reduces the average
length of instructions with the goal of minimizing code memory size. It also
improves energy per operation by reducing instruction fetch traffic.
Depending on the memory word size, with a stream of hybrid instruction
length instructions, some instructions will reside in more than one memory
word and will require more than one memory access to fetch the
instruction. Figure. 2.16 illustrates a memory map of a sequence of x86
instructions [11]. The digits indicate the instruction number in the stream.
The eight instructions in the stream require seven memory cycles, giving
0.875 memory cycles per instruction. For this example, the average
number of bytes per instruction is 3.375. Published statistics on the IBM
S360 show that this CISC architecture has approximately four bytes per
instruction [11].
2.10 CONCLUSIONS
This chapter provides an overview of various attributes of ISA and
different types of embedded processors. The cause for the increased code
size of embedded processors is illustrated with the example of MIPS32
ISA. Different techniques for code size reduction in embedded systems
have been briefly seen in this chapter.
92
Figure. 2.16: Memory map of variable instruction stream
The next chapter analyses the behaviour of embedded object codes
of MIPS32 and the Chapter 4 discusses two different techniques of hybrid
instruction encoding for MIPS32 processor to minimise the code size.
93
3. BEHAVIOUR OF EMBEDDED CODES FOR RISC
Embedded domain has a wide range of applications from sensor
systems to smart cellular phones. In many cases in the embedded domain,
it is difficult to isolate the software of an embedded system from the system
itself. Unlike the SPEC [3] for the general-purpose domain, there is no
dominant benchmark suite for the embedded domain. However, certain
industrial and academic benchmark packages [7] are available for the
embedded domain. MediaBench [51], MiBench [52], Berkeley Design
Technology, Inc. (BDTI), and Embedded Microprocessor Benchmark
Consortium (EEMBC) [43, 53] are four popular benchmark suites that are
commonly used by the embedded community. Whereas the MiBench and
MediaBench are academic packages containing sets of publicly available
programs that cover several embedded applications, the other two are
commercial suites. The BDTI contains DSP benchmarks written in
assembly language, and is very specific for simple DSPs and has limited
applicability outside this domain.
The EEMBC contains several sub-domains of benchmarks, including
automotive, imaging, consumer, and telecommunications sections. This
research work has identified 23 embedded applications, to cover the entire
spectrum of BOPES, from two representative set of embedded
benchmarks, MiBench and MediaBench. These applications are cross
compiled for the MIPS32 processor prior to static program analysis. Static
program analysis is the analysis of a computer program without actually
executing the program (analysis performed on executing programs is
known as dynamic analysis). The analysis is usually performed by an
automated tool either on the source code, or on the object code. Due to
need for flexibility, it has been decided to develop a new stand-alone tool,
as part of the research work, for analysing the MIPS32 object codes. This
also enabled incorporation of additional features at later stages.
94
This chapter provides an analysis of the object codes of 23
embedded benchmarks from MiBench and MediaBench to understand the
behavior of embedded applications and determine the strategy for
minimising the code size. Initially, a description of the two benchmarks
suites is provided. This is followed by a discussion on the behavior of MIPS
object codes of the embedded benchmarks, using MIDA, the custom built
code analyzer for MIPS32 object codes. Apart from measuring the static
instruction frequencies, this tool estimates the extent of under utilization of
the offset and immediate fields in the object codes.
3.1 MIBENCH BENCHMARKS
The MiBench is a set of benchmark programs in C, for six
embedded applications: Automotive and Industrial control, Consumer
Devices, Office Automation, Networking, Security and Telecommunication.
Table 3.1 lists the MiBench programs used for evaluating the HIE for
MIPS32. For certain applications, there are two versions: a small data set
version and large data set version. The small data set represents a light-
weight, useful embedded application of the benchmark, while the large data
set provides a more stressful, real-world application. Typical applications of
Automotive and Industrial Control are air bag controllers, engine
performance monitors and sensor systems. These benchmarks perform
mathematical calculations, bit counting, sorting and image recognition.
The automotive and industrial control category is a representative of
embedded control systems. The typical examples of consumer devices are
scanners, digital cameras and Personal Digital Assistants (PDAs). The
benchmarks mainly consist of multimedia applications with the
representative algorithms for jpeg encoding/decoding, image colour format
conversion, image dithering, colour palette reduction, MP3
encode/decoding and HTML typesetting. Most of the algorithms are taken
from SGI TIFF utilities. The typical examples of network devices are
95
switches and routers. The work done by the embedded processors in these
devices involves shortest path calculations, tree and table backups and
data input/output. The algorithms used in these benchmarks are finding a
shortest path in a graph and creating and searching a Patricia trie data
structure. The Telecommunications benchmarks have algorithms for voice
encoding / decoding, frequency analysis and checksum calculation. With
the popularity of internet, the trend is integrating wireless communication in
many portable consumer devices. The Office applications are primarily text
manipulation algorithms. The typical examples of office automation are
printers, fax machines and word processors. The PDAs, though grouped
under consumer devices, involve heavily manipulation of text for data
organization. The security benchmarks have algorithms for data encryption,
decryption and hashing. There are some benchmarks common to network,
security and telecommunication classes.
Table 3.1: MiBench Benchmarks
Auto/Industrial Domain
Program Functions
basicmath Simple mathematical calculations such as cubic function
solving, integer square root and angle conversions from
degrees to radians; these are needed for calculating
road speed or other vector values.
bitcount Tests the bit manipulation abilities of a processor by
counting the number of bits in an array of integers; five
methods are used by this program.
qsort Sorts a large array of strings into ascending order using
the quick sort algorithms
susan An image recognition package for recognizing corners
and edges, and typically used for a vision based quality
assurance application. It can smooth an image and has
adjustments for threshold, brightness, and spatial
control.
96
Consumer Domain
Program Functions
jpeg An algorithm for image compression and
decompression; commonly used to view images
embedded in documents. JPEG is a standard, lossy
compression image format.
lame An MP3 encoder that supports constant, average and
variable bit-rate encoding
typeset A general typesetting tool with a front-end processor for
HTML; representative of a core component of a web
browser that might be used in a consumer device. It
captures the processing required to typeset an HTML
document, without any rendering overheads.
Office Domain
Program Functions
stringsearch Searches for given words in phrases using a case
insensitive comparison algorithm
ispell A fast spelling checker supporting contextual spell
checking, correction suggestions, and languages other
than English; It is similar to Unix spell, but faster.
rsynth A text to speech synthesis program. It integrates several
pieces of public domain code into a single program.
Network Domain
Program Functions
dijkstra Constructs a large graph in an adjacency matrix
representation and then calculates the shortest path
between every pair of nodes using repeated applications
of Dijkstra's algorithm that is a well known solution to the
shortest path problem.
97
Network Domain
Program Functions
patricia Creates and searches a Patricia trie structure that is a
data structure used in place of full trees with very sparse
leaf nodes. Branches with only a single leaf are
collapsed upwards in the trie to reduce traversal time at
the expense of code complexity. Patricia tries are
commonly used in network applications to represent
routing tables.
CRC32 Same as CRC32 in Telecom
sha Same as sha in Security
blowfish Same as blowfish in Security
Security Domain
Program Functions
Blowfish
encrypt/
decrypt
Blowfish is a symmetric block cipher with a variable
length key. Since its key length can range from 32 to
448 bits, it is ideal for domestic and exportable
encryption.
sha A secure hash algorithm that produces a 160-bit
message digest for a given input; used in the secure
exchange of cryptographic keys and for generating
digital signatures. It is also used in the well-known MD4
and MD5 hashing functions.
Rjindael
encrypt/
decrypt
A block cipher with the option of 128-, 192-, and 256-bt
keys and blocks.
98
Telecommunications Domain
Program Functions
CRC32 Performs a 32-bit Cyclic Redundancy Check (CRC) on
a file. Useful to detect errors in data transmission.
FFT Performs a Fast Fourier Transform and its inverse
transform on an array of data. Fourier transforms are
useful in digital signal processing to find the frequencies
contained in a given input signal.
ADPCM
encode/
decode
Adaptive Differential Pulse Code Modulation; takes 16-
bit linear PCM samples and converts them into 4-bit
samples, yielding a compression rate of 4:1. ADPCM is
a variation of the well-known standard Pulse Code
Modulation (PCM).
GSM encode/
decode
Global Standard for Mobile communications. A standard
for voice encoding/decoding data streams. It uses a
combination of Time- and Frequency-Division Multiple
Access (TDMA/FDMA) to encode/decode data streams.
3.2 MEDIABENCH BENCHMARKS
The MediaBench suite is composed of multimedia applications
collected from image processing, communications and DSP applications.
Founded in 1997, MediaBench 1 was designed as a representative of
workload of emerging multimedia and communications systems. It included
applications written in C, ranging from image and video coding, to audio
and speech processing, and even encryption and computer graphics.
The original MediaBench suite had 11 application packages
covering six media areas: video, image, graphics, audio, speech, and
security. Many of these applications are unoptimised versions derived from
open-source programs that were not designed for the embedded domain.
99
The video benchmark, MPEG-2, characterized encoding and decoding
video sequences. Audio area was covered by ADPCM for encoding and
decoding audio streams. The image media type was characterized by three
applications: JPEG, EPIC and Ghostscript. The first two are for coding
standard colour images and the Ghostscript for postscript transcoding.
The speech area had three applications: GSM, G.721 and Rasta.
The first two are for encoding speech and the third is for speech recognition
application. Security is covered by two applications, PGP and pegwit for
encrypting and decrypting messages. Computer graphics is covered by
Mesa, a set of computer graphics libraries, similar to openGL, which
included three demo programs as the benchmarks for graphics.
MediaBench2 is an upgradation of MediaBench suite with some new
applications. These applications are needed if it is required to evaluate
performance of a processor or dynamic behaviour of an application. Since
none of these is the objective of this research work, benchmarks of
MediaBench1 can meet the requirement. A brief coverage on the selected
applications in MediaBench suite is given in Table 3.2.
Table 3.2: MediaBench Benchmarks
Program Functions
JPEG JPEG (pronounced "jay-peg") is a standardized compression
method for full-colour and gray-scale images. This package
contains C software to implement JPEG image compression
and decompression. JPEG is lossy, meaning that the output
image is not exactly identical to the input image. Two
applications are derived from the JPEG source code; cjpeg
does image compression and djpeg does decompression based
on the ISO JPEG standard for image compression. Source
code produced by the independent JPEG group. JPEG is
intended for compressing "real-world" scenes; line drawings,
cartoons and other non-realistic images are not its strong suite.
100
Table 3.2 (Continued)
Program Functions
MPEG A dominant standard for high quality digital video transmission.
The important computing kernel is a discrete cosine transform
for coding and the inverse transform for decoding. The two
applications used are mpeg2enc and mpeg2dec for encoding
and decoding respectively. mpeg2play is a player for MPEG-1
and MPEG-2 video bit streams. It is based on mpeg2decode by
the MPEG Software Simulation Group. In mpeg2decode, the
emphasis is on correct implementation of the MPEG standard
and comprehensive code structure. The latter is not always
easy to combine with high execution speed. Therefore a version
has been derived which is optimized for higher decoding and
display speed at the cost of a less straightforward
implementation and slightly non-compliant decoding. In
addition, all conformance checks and some fault recovery
procedures have been omitted from mpeg2play. A discrete
cosine transform for coding and the inverse transform for
decoding is used by this benchmark.
GSM An implementation of the European GSM 06.10 provisional
standard for full-rate speech transcoding, prI-ETS 300 036,
which uses RPE/LTP (residual pulse excitation/long term
prediction) coding at 13 kbit/s. GSM 06.10 compresses frames
of 160 13-bit samples (8 kHz sampling rate, i.e. a frame rate of
50 Hz) into 260 bits; for compatibility with typical UNIX
applications, this implementation turns frames of 160 16-bit
linear samples into 33-byte frames (1650 Bytes/s). The quality
of the algorithm is good enough for reliable speaker recognition;
even music often survives transcoding in recognizable form
(given the bandwidth limitations of 8 kHz sampling rate).
101
Table 3.2 (Continued)
Program Functions
G.721 The files in this package comprise ANSI-C language reference
implementations of the CCITT (International Telegraph and
Telephone Consultative Committee) G.711, G.721 and G.723
voice compressions. They have been tested on Sun
SPARCstations and passed 82 out of 84 test vectors published
by CCITT (Dec. 20, 1988) for G.721 and G.723. [The two
remaining test vectors, which the G.721 decoder
implementation for u-law samples did not pass, may be in error
because they are identical to two other vectors for G.723_40.]
This source code is released by Sun Microsystems, Inc. to the
public domain.
PEGWIT Pegwit is a program for performing public key encryption and
authentication. It uses an elliptic curve over GF(2^255), SHA1
for hashing, and the symmetric block cipher square.
EPIC EPIC (Efficient Pyramid Image Coder) is an experimental image
data compression utility written in the C programming language.
The compression algorithms are based on a biorthogonal
critically-sampled dyadic wavelet decomposition and a
combined run-length/Huffman entropy coder. The filters have
been designed to allow extremely fast decoding on conventional
(i.e., non-floating point) hardware, at the expense of slower
encoding and a slight degradation in compression quality (as
compared to a good orthogonal wavelet decomposition).
ADPCM Adaptive differential pulse code modulation is one of the
simplest and oldest forms of audio coding. ADPCM stands for
Adaptive Differential Pulse Code Modulation. It is a family of
speech compression and decompression algorithms. A
common implementation takes 16-bit linear PCM samples and
converts them to 4-bit samples, yielding a compression rate of
4:1. The ADPCM code used is the Intel/DVI ADPCM code
which is being recommended by the IMA Digital Audio
Technical Working Group. But this is NOT a CCITT G722 coder.
The CCITT ADPCM standard is much more complicated,
probably resulting in better quality sound but also in much more
computational overhead.
102
3.3 MIMEDIA BENCHMARK SUITE
In order to explore the strengths and weaknesses of the MIPS32 ISA
for embedded applications, a composite benchmark package named
MiMedia has been created with selected benchmarks from MiBench and
MediaBench suites. Certain benchmarks such as jpeg and gsm are
common to MiBench and MediaBench suites. Certain other benchmarks
such as mad, sphinx, PGP, Ghostscript, Rasta and Mesa have been
dropped due to some errors encountered during the downloading/ cross-
compilation process.
Since the goal of the research work is reducing memory size
occupied by the programs and not execution of the programs, there is no
need for finding dynamic instruction distribution or execution time of the
benchmarks. For the same reason, all the subprograms of a benchmark
application can be considered together into a single package. Hence a
composite suite has been created from the benchmarks of MiBench and
MediaBench avoiding duplication but including a variety of applications.
However, it has been decided to drop the small data set versions and
include only big data set versions. Table 3.3 presents the 23 applications
that have been grouped under the MiMedia suite. These have been
mapped into eight domains: Automotive and industrial control, Network,
Video, Audio, Image, Speech, Security, and Text. The susan has been
included in two areas of applications: both under Automobile and industrial
area, and image area.
103
Table 3.3: Embedded Applications for MiMedia Suite
Embedded
Domain Application
Name
Source
Benchmark
Suite Programs
Object code
size (bytes)
Automotive
and Industrial
Control
basicmath MiBench basicmath (large) 4984
bitcount MiBench bitcnts 4268
Qsort MiBench Qsort (large) 1944
susan MiBench susan 51000
Network dijkstra MiBench Dijkstra (large) 463144
patricia MiBench Patricia 463744
CRC32 MiBench CRC32 461500
Video MPEG2 MediaBench 1.Mpeg2encode
2.Mpeg2decode
1115208
Audio
ADPCM MediaBench 1. Rawcaudio is
coder (encoder)
2. Rawdcoder is
decoder
3.timing is test timer
for both coder and
decoder
1384008
lame MiBench lame 223892
Image JPEG MediaBench 1. Cjpeg; coder
2. djpeg; decoder
3. jpegtran;
lossless transcoding
between different
JPEG file formats.
4. rdjpcom;
displays the text in
COM (comment)
markers in a JFIF
file
5. wrjpgcom;
inserts user-
supplied text as a
COM (comment)
marker in a JFIF file
225744
104
Table 3.3 (Continued)
Embedded
Domain Application
Name
Source
Benchmark
Suite Programs
Object code
size (bytes)
Image EPIC MediaBench 1. epic; does
compression
2. unepic; does
decompression
972100
fft MiBench fft 498640
(susan) MiBench (susan) 51000
Speech
GSM MiBench 1. Toast; encoder
2. untoast;
decoder
1019216
G721 MediaBench 1. encode; Voice
encoder
2. decode; Voice
decoder
942232
rsynth MiBench say 26224
Security
pegwit MediaBench Pegwit hashing 510632
sha MiBench sha 4160
blowfish MiBench bf 466604
rjindael MiBench rjindael 476464
Text
typeset MiBench lout 505252
stringsearch MiBench stringsearch 462296
ispell MiBench ispell 48320
3.4 TYPICAL BEHAVIOUR OF EMBEDDED APPLICATIONS
The MiMedia benchmarks were cross-compiled on Intel PC and the
compiler output was analysed using the custom-built tool suite, MIDACC,
an offline code analyser and converter tool suite. It was required to
generate executable binaries for MIPS processors using cross compiler,
running under Linux OS on Intel Platform. There are many commercial
cross compilers available and published on internet among which has been
chosen, Sourcery CodeBench, which has a 'Lite' version free for
developers and academic purposes. It consists of a set of tools, like
105
compiler, linker, object dumping tools, library archiving etc. Even though
Gnu C Compiler (GCC) can be used in cross compilation, the Sourcery
CodeBench Tool chain, which is specially built for embedded system, will
produce optimized object code and executable. This Compiler can produce
object code and executable for all MIPS processors, MIPS1 ISA
instructions, MIPS2 ISA instructions, etc., and also for MIPS32 and MIPS64
for 32 bit and 64 bit processors. This compiler can cross compile C
programs for other RISC processors like R1000, mk4 also. The MIPS32
option has been chosen for this research work.
The MIDACC suite has two tools: the MIDA, a MIPS code analyser
and the MICC, a code converter. The results from MICC are discussed in
Chapters 4 and 5. This chapter focuses on MIDA. Given a MIPS32 object
code, the MIDA profiles the code and produces various statistics for the
given application program as follows:
1. Object code size
2. Frequency of each instruction class
3. Frequency of the 66 integer instructions
4. Usage pattern for offset and immediate values
5. Number of bytes wasted in underutilized fields of offset and
immediate
6. Frequency of usage of branch instructions
7. Usage pattern for GPRs
8. Usage pattern for shift amount in shift instructions
9. Number of bytes wasted due to redundant zeroes in the
instructions
Apart from the above nine aspects covered by MIDA, it was felt at a
later stage to carryout additional analysis of the embedded codes for
estimating the scope for introducing composite instructions and eliminating
106
avoidable duplicate information within certain instructions. An extension to
MIDACC was developed for this purpose and it is discussed in Chapter 6.
This tool named MIDACC Extender performs certain other functions also as
discussed in Chapter 6.
Appendix 1 describes the structure of the tool MIDA and Appendix 2
lists the sample outputs of MIDA for selected embedded applications.
Analysis of MIPS object codes using MIDA reveals several interesting
behaviour of embedded applications as discussed below. Though
experiments have been carried out with all the subprograms of the
benchmarks, the discussion below omits certain unimportant and short
programs. Similarly when two or more similar sub programs are included in
an application, the discussion is included about one of them only.
1. None of the embedded applications use all 66 integer instructions.
Number of unused instructions varies from 8 to 40. The following eight
instructions are not used by any application: ADD, ADDI, SUB, BGEZAL,
BLTZAL, BLTZ, MTHI and REF. Some benchmarks use hardly half the
number of instructions. For instance, qsort uses only 26 instructions. On
the other extreme, programs such as mpeg2 and fft, use 58 instructions.
Figure. 3.1 depicts the extent of unused and used instructions by the
embedded codes of the 23 benchmarks. As a majority trend, eleven
benchmarks use 55 or 56 instructions. Figure. 3.2 presents typical
distribution of utilized and unutilized instructions in the eight embedded
domains. The video segment uses highest number of instructions and the
Automotive segment uses the lowest number of instructions.
107
Figure. 3.1: Utilized and unutilised instructions in Embedded codes
108
Figure. 3.2: Distribution of utilized and unutilized instructions in
Embedded domains
2. Every application is using only limited categories of instructions.
Several instructions are used very sparingly and any given program is
mostly made up of only 5 to 7 types of instructions. Five common
instructions that are used liberally by 23 programs are LW, SW, ADDIU,
ADD and BEQ. LW is the only instruction that is used more than 10% in all
programs. Some benchmarks use certain specific instructions in plenty due
to the nature of operations. For instance, only susan uses LBU more than
5%. Likewise, only sha uses SB more than 10% and qsort uses JR more
than 5%. Figure. 3.3 illustrates how the 66 integer instructions are
populated in the 23 benchmarks. All three benchmarks in the Network
segment follow the same pattern. On the other hand, each of the three
benchmarks in the Text segment exhibit separate behavior. There is a near
uniform figure for the three highly used instruction groups, whereas for the
other two groups of 0% and 1%, the figures are scattered. Figure. 3.4
109
presents typical distribution of instruction density in the eight embedded
domains. There is uniformity among all the eight embedded segments
when it comes to the 5% and above cases.
Figure. 3.3: Frequency of integer instructions in Embedded codes
110
Figure. 3.4: Frequency of instructions usage in Embedded domains
3. A glance at the instruction counts in 23 benchmarks gives
interesting information. In majority of the benchmarks, the same sets of
instructions are heavily used. Four instructions - addu, addiu, lw and sw -
dominate all the benchmarks and the sum total of these four instructions
form a major portion of the benchmarks. These Frequently used Top Four
Instructions (FTFI) consume as high as 67% of the embedded codes.
Seventeen benchmarks have FTFI around 60. The lowest figure itself is
42%. Figure. 3.5 shows the variation of FTFI in the 23 benchmarks. Only
two benchmarks, basicmath and sha have FTFI below 50. Figure. 3.6
shows typical behaviour of FTFI for the eight embedded segments using
the geometric mean values of FTFI. The FTFI is the lowest for the Security
segment and the highest for the Image segment of benchmarks. Applying
80-20 rule, any technique to improve the density of these four instructions
will drastically reduce the code sizes of embedded programs.
111
Figure. 3.5: Population of FTFI in Embedded codes
112
Figure. 3.6: Distribution of FTFI in Embedded domains
4. Distribution of immediate values: The size of immediate values
affects instruction length. The majority of the immediate values are positive
as reported by Hennessy and Patterson [3] for the SPEC benchmarks. As
per their study, small immediate values are heavily used and large
immediate values are sometimes used mostly in addressing calculations.
Further, 8 bit immediate can capture about 50% of the cases and 16 bits
about 80%. The experiments with embedded benchmarks on MIPS
processor show interesting behavior. The 16 bits immediate field is heavily
under utilized by embedded benchmarks as shown in Figure. 3.7. Except
for two benchmarks, rsynth and typeset, other 21 programs need full 16
bits in less than 10% of the cases. Further, most benchmarks need full 16
bits in 0% to 5% of the cases only. The benchmarks of Automotive
applications and the Speech segments are in the two extreme ends but
within a short range as shown in Figure. 3.8.
113
Figure. 3.7: Usage of full 16 bit immediate by Embedded codes
114
Figure. 3.8: Trends in usage of 16 bit immediate in Embedded
domains
5. As per Hennessy and Patterson's analysis [3] with SPEC
benchmarks, displacement values are widely distributed. There are both a
large number of small values and a fair number of large values. The factors
contributing to the wide distribution of displacement values are multiple
storage areas for variables and different displacements to access them
apart from the overall addressing scheme used by the compiler. The
analysis shows that embedded applications use full 16 bits of offset field
very rarely as shown in Figure. 3.9. In fact, 16 benchmarks use only 15 bits
always. Even the remaining programs need 16 bits maximum in 3% of the
cases. The overall behavior of embedded segments in this aspect is shown
in Figure. 3.10. Network and Speech segments are satisfied with 15 bits of
offset. The worst case requirement is that of Image segment that has 2% of
cases using more than 15 bit offset.
115
Figure. 3.9: Extent of usage of full 16 bit offset by embedded codes
116
Figure. 3.10: Trends in usage of 16 bit offset field in Embedded
domains
6. Extent of memory Wastage In Immediate And offset Fields
(WASTIO) in embedded object codes comes to significant amount. The
underutilization of these two fields of instructions due to the use of
redundant 0's are classified into four types a, b, c and d according to the
four combinations of wastages in the object code as defined in Table 3.4.
WASTIO = 2a+b+c. Programs that have higher values of d and lower
values of a, b and c will cause less wastage of memory due to redundant
zeroes. WASTIO percentage is calculated using the formula, WASTIO
percentage = 100 X (WASTIO / Object code Size). The extent of wastage
due to underutilization of the offset and immediate fields varies from 8% to
117
16% of the code size for the embedded applications as shown in
Figure. 3.11.
Table 3.4: Four types of offset / immediate byte patterns
Type Offset / immediate bytes
a All 16 bits are 0's
b One byte wastage due to all zeroes in the least significant byte
c One byte wastage due to all zeroes in the most significant byte
d No wastage; both bytes have non zero value
Twenty benchmarks have WASTIO percentage of either 11 or 12.
Figure. 3.12 shows typical behaviour of WASTIO for the eight embedded
segments using the geometric mean values of WASTIO. Four segments
have equal amount of WASTIO percentage and the variation of WASTIO
percentage among the eight embedded segments is from 10 to 13 only. In
order to give an idea of number of bytes wasted in the immediate and offset
fields, Table 3.5 compares the largest programs in each application domain
of MiMedia. Though the video program mpeg2 is the largest embedded
benchmark of MiMedia, the number of bytes wasted in offset and
immediate fields is higher for the Text benchmark, typeset. Figure. 3.13
depicts the distribution of WASTIO percentage in the three types a, b and
c. Except for the Text, the b component of WASTIO is zero for the other
embedded domains.
118
Figure. 3.11: WASTIO Percentages in Embedded applications
119
Figure. 3.12: Extent of WASTIO in Embedded domains
Table 3.5: Trends in Embedded Applications: WASTIO components
Embedded
Domain
Largest
Application
Code size
(bytes)
WASTIO
%
Number of
bytes
wasted
a
%
b
%
c
%
Automotive and Industrial Control
susan 51000 11 5610 4 0 7
Network patricia 463744 12 55649 3 0 9
Video MPEG2 578880 11 63677 2 0 9
Audio ADPCM 460884 12 55306 3 0 9
Image fft 498640 11 54850 2 0 9
Speech GSM 509608 11 56057 2 0 9
Security pegwit 510632 12 61276 3 0 9
Text typeset 505252 15 75788 3 1 11
120
Figure. 3.13: WASTIO distribution in Embedded domains
7. Usage of general-purpose registers (GPRs): MIPS32 has 32
32-bit GPRs that are also known as integer registers. Reduction of
registers may pose problem to compiler in register allocation thereby
impacting on speeding up the code. It has been reported by Hennessy and
Patterson [3] that at least 16 registers are essential for the graph coloring
technique used by register allocation algorithms. The analysis of embedded
codes establishes that having just 16 registers will be inefficient. The
frequency of usage of more than 16 registers by the embedded codes is
shown in Figure. 3.14. Only one benchmark, susan, needs more than 16
registers in just 5% of the cases. The highest requirement is by JPEG that
needs more than 16 registers in 59% of the cases.
121
Figure. 3.14: Usage of more than 16 registers by Embedded
applications
122
The overall behavior of embedded segments in this aspect is shown
in Figure. 3.15. The Audio, Speech and Security segments have worst case
requirements at the level of either 45% or 46%. The Automotive segment
has the minimum requirement at 24%.
Figure. 3.15: Usage of more than 16 registers in Embedded domains
8. Shift amount: MIPS32 uses 5 bits for specifying the shift amount
thereby allowing a maximum of 32-bit shift in one operation. The analysis of
embedded codes reveals that in majority of cases, the shift amount is less
than or equal to 16 bits as illustrated in Figure. 3.16. However, about 13
benchmarks need more than 16-bit shifts varying from 10% to 13% of the
cases. While qsort does not need more than 16-bit shifts, bitcnts need
more than 16-bit shifts in 36% of the cases. The frequency of usage of
more than 16-bit shifts by the embedded codes is shown in Figure. 3.17.
The Security segment has maximum requirement whereas Automotive and
Image segments have very low requirements for more than 16 bit shifts.
123
Figure. 3.16: Usage of more than 16 bit shifts in Embedded
applications
124
Figure. 3.17: Frequency of more than 16 bit shifts in Embedded
domains
9. The extent of redundant LOAD and STORE instructions is
estimated by RMA analysis of the code. Further discussion on this is
presented in Chapter 5.
10. The extent of presence of branch instructions is estimated by the
tool, MIDA, even though it is useful mainly for dynamic simulation.
However, the objective of obtaining this data is to get some idea regarding
the code space occupied by the branch instructions. The analysis of
embedded codes reveals that branch instructions together take up 3% to
11% of the code space as illustrated in Figure. 3.18. In thirteen
benchmarks, the branch instructions occupy 11% of the code. The segment
wise code space required by branch instructions is shown in Figure. 3.19.
The Automotive segment has the lowest case of 4% code branch
instructions whereas the Network segment has the maximum case of 11%
requirement.
125
Figure. 3.18: Extent of branch instructions in Embedded applications
126
Figure. 3.19: Usage of branch instructions in Embedded domains
11. The wastage of code space due to presence of Redundant
Zeroes (RZ) in the instructions is estimated and found to vary from 5% to
13% as illustrated in Table 3.6. Three benchmarks, lame, sha and lout,
have the least RZ of 5%. The qsort has the highest RZ of 13% of code. The
RZ and WASTIO together give an idea of the extent of unused bits in the
embedded codes. In addition, the ratio between Load/Store and ALU type
instructions, LSI/ALUI, is an important measure of code bloating in
embedded codes. This aspect will be discussed in Chapter 5. The three
major parameters contributing to code bloat factor (CBF) are WASTIO,
FTFI, and RZ. Table 3.6 compares these parameters of 23 benchmarks.
One may tend to conclude that the CBF is a measure of extent of code
reduction possible. A discussion on this is presented in Chapter 4 along
with the analysis of the results for code size reduction.
127
Table 3.6: Code bloat factors for Embedded object codes for MIPS32
Embedded
Domain Benchmark % RZ
%
WASTIO FTFI
Automotive
and Industrial
Control
basicmath 12 9 46
bitcount 12 14 59
Qsort 13 11 56
susan 8 17 67
Network dijkstra 7 13 60
patricia 7 13 60
CRC32 7 13 60
Video MPEG2 7 13 59
Audio
ADPCM 7 13 60
lame 5 12 57
Image
JPEG 11 14 58
EPIC 7 13 60
fft 7 13 59
(susan) 8 17 67
Speech GSM 7 13 59
G721 7 13 60
rsynth 11 11 50
Security pegwit 7 13 59
sha 5 17 42
blowfish 7 13 60
rjindael 7 13 60
Text typeset 5 16 63
stringsearch 7 13 60
ispell 7 12 53
128
In addition to the above mentioned 11 aspects, there are certain
other code analysis functions provided by MIDAAC extender as mentioned
earlier. These are discussed in Chapter 6.
3.5 CONCLUSIONS
This chapter provides an analysis of the MIPS object codes of 23
embedded benchmarks from MiBench and MediaBench to understand the
behavior of embedded applications and determine the strategy for
minimising the code size using MIDA, the custom built code analyzer for
MIPS32 object codes.
As already mentioned, the MIDACC developed as part of the
research work acts as a standalone software tool for both MIPS32 code
analysis and for evaluating the new ISA for MIPS32 and measuring the
code size reduction. Since simulation of a new ISA is involved, it will be a
complex process if any existing simulator is to be used for this purpose as
extensive modifications will be required to conduct the type of code
analysis desired. The objective of this research is not execution of
embedded programs but only analysing the behaviour of embedded
applications and measuring static code sizes of HIE-MIPS for various
embedded applications and comparing with static code sizes of MIPS32.
Hence a decision was taken to develop an offline tool suite that can do
both code analysis and conversion of the object codes of MIPS32 into
object codes of new ISA. The architecture of the tool and the strategies
followed in implementation are discussed in Appendix 1. The Appendix 2
provides user guide to MIDACC and sample results of MIDACC.
In the next chapter, two different design strategies for minimizing the
WASTIO and reducing unused zeroes in the opcodes are discussed. Both
the options are evaluated using the MIDACC.
129
4. HYBRID INSTRUCTION ENCODING FOR RISC CORES
This chapter proposes an ISA level technique, for reduction of
average instruction size so as to minimize the embedded object code
generated by the RISC compiler. This approach relieves from the
embedded system developers the burden of incorporating external static
code compression cum dynamic decompression mechanisms, in each new
product, thereby saving on product development cost, and also reducing
the time-to-market. In order to achieve this, it is required to develop a new
type of RISC processor as well as the entire tool chain, but it is worth
investing in view of the growing embedded processor market.
The Fixed Instruction Encoding (FIE) used in RISC processors helps
in simpler instruction decoding and easy pipeline design. But the FIE
increases the object code size as some fields are either unused or
underutilized in several instructions. Since all instructions have to be of
uniform length, many redundant zeroes are inserted in several instructions
to maintain 32 bit instruction length. Further, huge wastage of memory
space occurs due to under utilization of the immediate and offset fields in
the instructions. This chapter proposes replacement of FIE with Hybrid
Instruction Encoding (HIE) with two modifications to RISC Architecture:
multiple instruction sizes, and hybrid lengths for the offset and immediate
fields. The provision for multiple instruction sizes minimizes unused fields in
most instructions thereby reducing code size. Similarly, allowing hybrid
lengths for the offset and immediate fields minimizes wastage of bits in
these fields.
This chapter deals with design of two different versions of HIE for
MIPS processor and estimating the code size reduction. The HIE1 version
limits the number of general purpose registers to 16 so as to enable
reducing the length of several instructions by one byte. On the other hand,
the HIE2 version reduces the maximum length of offset/immediate fields to
130
15-bits. To help estimate the code saving in the proposed architecture, both
the HIE versions have been designed as a modification to MIPS32 ISA. For
each of the 66 integer instructions of MIPS32, an equivalent HIE instruction
has been designed for both versions. This chapter discusses the designs of
both versions and the code size reduction achieved. Further, the
modifications required in the processor micro architecture to support the
HIE versions are reviewed.
4.1 MIPS ISA AND CODE WASTAGE
The early MIPS architectures were 32-bit, with 64-bit versions added
later. Multiple revisions of the MIPS instruction set exist, including MIPS I,
MIPS II, MIPS III, MIPS IV, MIPS V, MIPS32, and MIPS64. The current
versions are MIPS32 and MIPS64. MIPS32 supports only 32-bit data and
address whereas MIPS64 supports 64-bit data and address. There are also
two special versions, MIPS16 and microMIPS, targeting embedded
applications. The term MIPS is used liberally to mean MIPS32, the target
processor chosen for the research work. The term MIPS R2000 is used
when referring to specific features of the MIPS version.
4.1.1 MIPS Instruction Set
The instruction set of MIPS [11, 27] consists of a variety of basic
instructions such as
21 arithmetic instructions
8 logic instructions
8 bit manipulation instructions
12 comparison instructions
25 branch/jump instructions
15 load instructions
10 store instructions
131
8 move instructions
4 miscellaneous instructions
4.1.2 MIPS Instruction Format
The instruction formats of MIPS can be classified into three broad
categories as R-Type (Register), I-Type (Immediate) and J-Type (Jump) as
shown in Table 4.1. The R-type instructions perform ALU operations with
two register sources and one register destination address. The I-type
instructions perform load, store, and ALU operations with an immediate
operand.
Table 4.1: MIPS Instruction Formats
Format
type
Bits,
31-26
Bits,
25-21
Bits,
20-16
Bits,
15-11
Bits,
10-6
Bits,
5-0
Nature of
operations
R op rs rt rd sa opx
(fn)
arithmetic
operations
I op rs rt offset / immediate transfer, branch,
immediate
operations
J op target jump operation
Table 4.2 defines the different fields in the instructions. The J -type
instructions perform unconditional branching to the target address. There
are also conditional branch instructions in the I type instructions; these use
a signed 16-bit instruction offset field. Hence they can jump 215 -1
instructions (not bytes) forward or 215 instructions backwards.
132
Table 4.2: MIPS Instruction Fields
Field Purpose
Op a 6-bit operation code
Rs a 5-bit source register specifier
Rt a 5-bit target (source/destination) register or branch
condition
immediate a 16-bit immediate, branch displacement or address
displacement
target a 26-bit jump target address
Rd a 5-bit destination register specifier
Sa a 5-bit shift amount
opx/fn a 6-bit operation code extension (function) field
MIPS ISA uses a fixed instruction length of 32 bits with 6 bits allotted
to the opcode. This provides only 64 opcodes which is insufficient for the
number of desired instructions. To resolve this problem, MIPS supports
variable-length operation codes within a fixed-length instruction that uses
expanded opcodes. The most frequently used operations are directly
encoded in the 6-bit opcode, while a small set of the 64 possible codes are
reserved as escape codes that require decoding of more bits in the
instruction to obtain the full opcode. This technique of expanded opcodes
enriches the instruction set but limits the length of the instruction. Figure.
4.1 provides instruction map of MIPS R2000. Operation codes marked with
a dagger cause reserved instruction exceptions and these are reserved for
later versions of MIPS architecture. The opcode bits are 26-31 and the
initial decoding of the opcode is shown at the top of the figure. As an
example, for the opcode 000011, the instruction is JAL, a jump and link,
instruction. If the opcode is 000000, the special instructions are invoked
with bits 0-5 of the instruction. These 6 bits, found in the R format only, are
decoded in the SPECIAL map. BCOND is expanded with bits 16-20 of the
instruction while COP0 is expanded with bits 0-4. COP1, 2, and 3 are
expanded with bits 16-25. Table 4.3 identifies the actions performed by the
133
integer instructions of MIPS R2000 [27]. In addition, there are floating-point
instructions that are not included here since it is beyond the scope of this
thesis. The MIPS has a floating-point coprocessor (numbered 1) that
operates on single precision and double precision floating-point numbers.
The coprocessor has its own registers.
28..26 Opcode
31..29 0 1 2 3 4 5 6 7
0 SPECIAL
BCOND J JAL BEQ BNE BLEZ BGTZ
1 ADDI ADDIU SLTI SLTIU
ANDI ORI XORI LUI
2 COP0 COP1 COP2 COP3 † † † †
3 † † † † † † † †
4 LB LH LWL LW LBU LHU LWR †
5 SB SH SWL SW † † SWR †
6 LWC0 LWC1 LWC2 LWC3 † † † †
7 SWC0 SWC1 SWC2
SWC3
† † † †
2..0 SPECIAL
5..3 0 1 2 3 4 5 6 7
0 SLL † SRL SRA SLLV † SRLV SRAV
1 JR JALR † SYSCALL
BREAK † †
2 MFHI MTHI MFLO MTLO † † † †
3 MULT MULTU DIV DIVU † † † †
4 ADD ADDU SUB SUBU AND OR XOR NOR
5 † † SLT SLTU † † † †
6 † † † † † † † †
7 † † † † † † † †
18..16 BCOND
20..19 0 1 2 3 4 5 6 7
0 BLTZ BGEZ
1
2 BLTZAL BGEZAL
3
25..23
COPz 22,21,1
6 0 1 2 3 4 5 6 7
0,0,0 MF MT
BCT
0,0,1 BCF
0,1,0
0,1,1
1,0,0 CF CT
Figure. 4.1: MIPS R2000 instruction map
134
Table 4.3: MIPS32 Integer instructions and actions
(Abbreviations used: R-Register; I-Immediate; O-Offset; T-target address)
Sl.
no.
Instruction
name
Operation
Type
Operand
Format Action
1 add ALU R Addition with overflow
2 addu ALU R Addition without overflow
3 addi ALU I Addition immediate with
overflow
4 addiu ALU I Addition immediate without
overflow
5 and ALU R Logical AND
6 andi ALU I Logical AND of rs with zero-
extended immediate
7 div ALU R Division with overflow;
leave quotient and
remainder in registers lo
and hi respectively
8 divu ALU R Division without overflow;
result storing similar to div
instruction
9 mult ALU R Multiply; leave the low order
and high order words of the
product in registers lo and
hi respectively
10 multu ALU R Unsigned multiply; result
storing similar to mult
instruction
11 nor ALU R Logical NOR
12 or ALU R Logical OR
13 ori ALU I Logical OR of rs with zero-
extended immediate
14 sll ALU R Shift left logical; by number
of positions specified by sa
field
15 sllv ALU R Shift left logical variable; by
number of times specified
by rs
135
Table 4.3 (Continued)
Sl.
no.
Instruction
name
Operation
Type
Operand
Format Action
16 sra ALU R Shift right arithmetic; by
number of positions
specified by sa field
17 srav ALU R Shift right arithmetic
variable; by number of
times specified by rs
18 srl ALU R Shift right logical; by
number of positions
specified by sa field
19 srlv ALU R Shift right logical variable;
by number of times
specified by rs
20 sub ALU R Subtract with overflow
21 subu ALU R Subtract without overflow
22 xor ALU R Logical exclusive OR
23 xori ALU I Logical exclusive OR of rs
with zero-extended
immediate
24 lui CONMANIP I Load lower halfword of
immediate into upper
halfword of rt ; reset other
bits of rt
25 slt Compare R Set less than
26 sltu Compare R Set less than unsigned
27 slti Compare I Set less than immediate
28 sltiu Compare I Set less than unsigned
immediate
29 bczt Branch O Branch coprocessor z true
30 bczf Branch O Branch coprocessor z false
31 beq Branch O Branch on equal
32 bgez Branch O Branch of rs is greater than
or equal to 0
136
Table 4.3 (Continued)
Sl.
no.
Instruction
name
Operation
Type
Operand
Format Action
33 bgezal Branch O Branch if rs is greater than
or equal to 0; in addition,
save (link) the address of
the next instruction in R31.
34 bgtz Branch O Branch on greater than 0
35 blez Branch O Branch if rs is less than or
equal to 0
36 bltzal Branch O Branch if rs is less than 0;
in addition, save (link) the
address of the next
instruction in R31.
37 bltz Branch O Branch on less than 0
38 bne Branch O Branch on not equal
39 j Jump T Unconditionally jump to the
instruction at target
40 jal Jump T Unconditionally jump to the
instruction at target; in
addition, save (link) the
address of the next
instruction in R31.
41 jalr Jump R Unconditionally jump to the
instruction whose address
is in rs; in addition, save
(link) the address of the
next instruction in rd.
42 jr Jump R Unconditionally jump to the
instruction whose address
is in rs
43 lb Load O Load byte with sign-
extension
44 lbu Load O Load byte without sign-
extension
45 lh Load O Load halfword with sign-
extension
137
Table 4.3 (Continued)
Sl.
no. Instruction
name Operation
Type Operand
Format Action
46 lhu Load O Load halfword without sign-
extension
47 lw Load O Load word
48 lwcz Load O Load word into coprocessor
register
49 lwl Load O Load the left bytes from the
word, at the possibly
unaligned address, into rt
50 lwr Load O Load the right bytes from
the word, at the possibly
unaligned address, into rt
51 sb Store O Store the low byte from rt
52 sh Store O Store the low halfword from
rt
53 sw Store O Store the word from rt
54 swcz Store O Store the word from the
coprocessor register
55 swl Store O Store the left bytes from rt
at the possibly unaligned
address
56 swr Store O Store the right bytes from rt
at the possibly unaligned
address
57 mfhi Data move R Transfers hi to rd
58 mflo Data move R Transfers lo to rd
59 mthi Data move R Transfers rs to hi
60 mtlo Data move R Transfers rs to lo
61 mfcz Data move R Transfers coprocessor
register to rt
62 mtcz Data move R Transfers rt to coprocessor
register
63 Syscall Exception /
Interrupt
R System call
64 Break Exception/
Interrupt R Cause exception
65 NOP Exception /
Interrupt - Do nothing
66 rfe Exception /
Interrupt R Return from exception
138
The register file of MIPS R2000 architecture consists of thirty two,
32-bit registers as shown in Figure. 4.2. These registers are used for
operands, results (both integer and floating point), and index registers. One
of the registers, R0 is always set to zero (by the hardware) for use in
clearing a register, providing a zero constant, and support of address
arithmetic. Two 32-bit registers are provided to support multiplication
(holding the double length product) and division (holding the quotient and
the remainder). The program counter is a separately architected register
unlike in certain processors such as ARM wherein one of the GPRs is
dedicated as PC.
Like all other RISC processors, the MIPS R2000 follows load/store
architecture. The load and store instructions use two memory accesses
(instruction fetch and operand fetch) and operate on signed and unsigned
bytes, words (2bytes) and double words. Most instructions use the three-
address register-to-register format. The data types consist of following:
single- and double-precision IEEE floating point
signed, 2's complement 8-, 16-, and 32-bit integers
unsigned 8-, 16-, and 32-bit integers
Figure. 4.2: MIPS R2000 registers
139
The MIPS R2000 memory is byte addressable. Thus the address for
a 16-bit integer ignores the LSB of the address. The 2 LSBs of the address
are ignored for 32-bit integers and single-precision floating-point data
types. The three LSBs of the address are ignored for double precision
floating-point data types. Binding of the addresses to opcodes is
accomplished in the instruction decoding hardware.
4.1.3 Wastage in MIPS32 Code
The general drawback of fixed instruction size feature of RISC
architecture has been discussed in Chapters 1 and 2. The following
discussion is specific to MIPS32.
1. Several bits are unused in many instructions as pointed out in
Chapter 2. Table 4.4 lists the opcodes of 66 integer instructions indicating
their formats and the number of redundant zeroes. The R-type instructions
use totally 12 bits (OP and OPX bits) to specify the operation though there
are only maximum of 64 different R-type operations in MIPS32 ISA. Since
MIPS32 has 32 GPRs, five bits are used to specify each register operand.
Leaving out the 12 bits (op and fn) for the operation, 20 bits have been left
for the operand fields whereas only 15 bits are needed. Hence five bits are
unused in R-type instructions as illustrated in Figure. 4.3 for the and
instruction. If three more bits are eliminated from any field, the instruction
length can be reduced to 24 bits. The solution in HIE1 for relieving three
bits involves reduction of register field to four bits as discussed in the next
section. In HIE2, the OPX field is either eliminated or replaced by a shorter
field.
140
Table 4.4 MIPS32 Instructions, opcodes and redundant zeros
(Abbreviations used: R-Register; I-Immediate; O-Offset; T-target address;
RZ - Redundant zero)
Sl
no.
Instruction
name
OP
(bits
31-26)
OPX
(bits 5-0)
Operation
Type
Operand
Format
No. of
RZ
1 add 000000 100000 ALU R 5
2 addu ,, 100001 ALU R 5
3 addi 001000 - ALU I 0
4 addiu 001001 - ALU I 0
5 and 000000 100100 ALU R 5
6 andi 001100 - ALU I 0
7 div 000000 011010 ALU R 10
8 divu ,, 011011 ALU R 10
9 mult ,, 011000 ALU R 10
10 multu 000000 011001 ALU R 10
11 nor ,, 100111 ALU R 5
12 0r ,, 100101 ALU R 5
13 ori 001101 - ALU I 0
14 sll 000000 000000 ALU R 0
15 sllv ,, 000100 ALU R 5
16 sra ,, 000011 ALU R 0
17 srav ,, 000111 ALU R 5
18 srl ,, 000010 ALU R 0
19 srlv ,, 000110 ALU R 5
20 sub ,, 100010 ALU R 5
21 subu ,, 100011 ALU R 5
22 xor ,, 100110 ALU R 5
23 xori 001110 - ALU I 0
24 lui 001111 - CONMANIP I 5
25 slt 000000 101010 Compare R 5
26 sltu ,, 101011 Compare R 5
27 slti 001010 - Compare I 0
141
Table 4.4 (Continued)
Sl
no.
Instruction
name
OP
(bits
31-26)
OPX
(bits 5-0)
Operation
Type
Operand
Format
No. of
RZ
28 sltiu 001011 - Compare I 0
29 bczt - - Branch O 4
30 bczf - - Branch O 4
31 beq 000100 - Branch O 0
32 bgez 000001 - Branch O 4
33 bgezal ,, - Branch O 0
34 bgtz 000111 - Branch O 5
35 blez 000110 - Branch O 5
36 bltzal 000001 - Branch O 0
37 bltz ,, - Branch O 5
38 bne 000101 - Branch O 0
39 j 000010 - Jump T 0
40 jal 000011 - Jump T 0
41 jalr 000000 001001 Jump R 10
42 jr 000000 001000 Jump R 16
43 lb 100000 - Load O 0
44 lbu 100100 - Load O 0
45 lh 100001 - Load O 0
46 lhu 100101 - Load O 0
47 lw 100011 - Load O 0
48 lwcz - - Load O 0
49 lwl 100010 - Load O 0
50 lwr 100011 - Load O 0
51 sb 101000 - Store O 0
52 sh 101001 - Store O 0
53 sw 101011 - Store O 0
54 swcz - - Store O 0
55 swl 101010 - Store O 0
56 swr 101110 - Store O 0
142
Table 4.4 (Continued)
Sl
no.
Instruction
name
OP
(bits
31-26)
OPX
(bits 5-0)
Operation
Type
Operand
Format
No. of
RZ
57 mfhi 000000 010000 Data move R 15
58 mflo ,, 010010 Data move R 15
59 mthi ,, 010001 Data move R 15
60 mtlo ,, 010011 Data move R 15
61 mfcz - - Data move R-O 11
62 mtcz - - Data move R-O 11
63 Syscall 000000 001100 Exception /
Interrupt
R-O 20
64 Break ,, 001101 Exception/
Interrupt
R-O 0
65 NOP ,, - Exception /
Interrupt
- 26
66 rfe 010000 100000 Exception /
Interrupt
- 19
0's rs rt rd 0's 100100
6 5 5 5 5 6
Figure. 4.3: Format of and instruction in MIPS32 ISA
001001 rs rt 0's
6 5 5 16
Figure. 4.4: addiu instruction with immediate field containing zero value
001001 rs rt 0000000000101100
6 5 5 16
Figure. 4.5: addiu instruction with only most significant byte of
immediate as zero
143
001001 rs rt 0000001100000000
6 5 5 16
Figure. 4.6: addiu instruction with only least significant byte of
immediate as zero
001001 rs rt 0001010100001001
6 5 5 16
Figure. 4.7: addiu instruction with both bytes of immediate field as
non-zero value
2. In immediate type instructions such as addi, 16 bits are used for
specifying the immediate operand. In most cases, eight bits are sufficient
for the immediate operand and the remaining 8 bits become redundant.
Figures. 4.4 to 4.7 illustrate the four different cases of immediate field
patterns out of which only in one case, both bytes of the immediate field are
non-zero. Thus in the other three cases, there is wastage of either one byte
or two. It was seen in Chapter 3 that two out of 23 benchmarks need a
maximum of 8-bits only for the immediate field. Even among the other
benchmarks, there is requirement beyond 8-bits only in maximum 10% of
the cases. This behaviour of embedded codes has been exploited by us in
designing the hybrid encoding for the immediate field as discussed in the
next section.
3. In branch instructions such as beq, the offset field is underutilized
in those cases where the offset required can be specified with 8 bits. It was
seen in Chapter 3 that 16 out of 23 benchmarks need a maximum of 8-bits
only for the offset field. Even among the other benchmarks, only in
maximum cases of 3%, there is requirement beyond 8 bits. The hybrid
encoding technique used for the immediate field is followed for the offset
field also.
144
The impact of these drawbacks on the code size has been studied in
chapter 2. Sections 4.2 and 4.4 deal with two different HIE techniques for
MIPS32 to achieve the goal of minimising unused fields within instructions,
and improving the utilization of the offset and immediate fields. When a
new processor is designed, the computer architect has greater flexibility in
choosing the ISA attributes such as instruction formats, opcodes,
addressing modes and number of registers. Developing a new processor is
an involved process requiring appropriate tools and it consumes several
man years. To evaluate such a processor, an entire tool chain has to be
created. On the other hand, this research work deals with HIE design for
MIPS by modifications to certain features of the MIPS32. This approach
helps to verify and prove the concept though the extent of code size
reduction achievable will be slightly less compared to what is possible with
a newly designed HIERISC processor. In HIE1, the solution involves
reducing the number of GPRs to 16, whereas in HIE2, the maximum length
of offset/immediate is reduced by 1 bit. In HIE1, the length of OP and fn
fields are retained as in MIS32. In HIE2, the six bit fn field is eliminated;
instead, an iid field varying from two to three bits serves the purpose of
instruction identification. Both design approaches have certain common
aspects such as hybrid lengths for offset and immediate fields. First, the
HIE1 is discussed in detail in section 4.2, and then the design of HIE2 is
taken up in section 4.4.
4.2. HIE1 METHODOLOGY FOR MIPS32
To evaluate the effectiveness of our proposed HIERISC ISA, it is
designed as a piggyback to the MIPS32 ISA. Basically, for every integer
instruction of MIPS32 ISA, an equivalent HIE instruction is provided. In the
HIE1 ISA, four different sizes are allowed for the integer instructions: three
8-bit, seven 16-bit, twenty one 24-bit, three 32-bit, and thirty two
instructions with three length options: 16/24/32 bits.
145
4.2.1 HIE1 RISC Instructions
The HIE1 design supports nine different types of integer instructions.
Figure. 4.8 shows the proposed instruction formats for HIE1. Out of 66
integer instructions, j, jal, and break, are retained as 32 bits, due to system
software implications. The remaining instructions are translated into one of
the HIE1 types. In several ALU instructions, there are five redundant zeros.
As pointed out earlier, the register fields are reduced by one bit each so
that these instructions can be reduced to 24 bits as shown in Figure. 4.9.
This restricts the number of GPRs to 16; however, it will not strain the
compiler as graph colouring technique for register allocation works
satisfactorily for 16 GPRs [3]. Popular RISC Processors such as ARM and
SH4 have only 16 registers.
Figure. 4.8: HIE1 RISC Instruction Formats
146
op rs rt rd fn
6 4 4 4 6
Figure. 4.9: R Type instruction in HIE1
The nop, rfe and syscall are 8-bit instructions with a common
opcode and a 2-bit iid field to identify the instruction. The 16-bit instructions
are jr, mfhi, mflo, mthi, mtlo, mfcz and mtcz. In mfcz and mtcz, the rd field is
retained as 5 bits since it refers to coprocessor registers. The iid bit
differentiates between mfcz and mtcz. The mthi, mflo, mthi and mtlo have a
common format and the register field is shared between rd and rs. In
mfhi/mflo/mthi/mtlo format, the rd/rs field denotes rd for mfhi and mflo. For
mthi and mtlo, it denotes rs.
The 24-bit instructions that form three different R-types are add,
addu, and, div, divu, mult, multu, nor, or, sll, sllv, sra, srav, srl, srlv, sub,
subu, xor, slt, sltu, and jalr. In type1, there is no sa field. In type2, there is
no rs field. In type3, there are four zeroes to maintain byte alignment. The
remaining 32 instructions have three length options: 16, 24, or 32 bits. The
offset and immediate fields are encoded in a unique way in our proposal.
Table 4.5 shows a typical example using hexadecimal notation. If the value
of the offset / immediate is zero, these fields are omitted. When one of the
bytes in the offset / immediate is zero, that byte is omitted and the hybrid
identifier hl is formed accordingly. All the four cases have a common
opcode.
147
Table 4.5: Sample Encoding of Offset/Immediate Field in HIE-MIPS
MIPS32 Encoding HIE1-MIPS Encoding hl bits HIE1 Instruction
size (bits)
0000 Nil 00 16
000F 0F 01 24
0F00 0F 10 24
0F0F 0F0F 11 32
4.2.2 Mapping MIPS32 ISA to HIE1
MIPS Instructions are converted into HIE1RISC instructions of
different types as illustrated in table 4.6. As indicated earlier, for three
instructions, j, jal, and break there is no change. For all others, conversion
depends on the opcode and immediate / offset fields. All unconverted
instructions are retained as 32 bits. For some instructions, more than one
type of conversion is possible. For example, for addi instruction, three
cases are there; addi-a, addi-b, addi-c. a means converted length is 16 bits
whereas b and c mean converted length is 24 bits. Identifying certain MIPS
instructions involve multiple match conditions. For example, for bczt
instruction, the first byte may be any one of the four combinations: 41, 45,
49,4D. In addition, the second byte has 16 combinations:
01,03,05,07,09,0B,0D,0F,11,13,15,17,19,1B,1D,1F. For NOP instruction,
all 32 bits are 0’s.
148
Table 4.6: MIPS32 ISA to HIE1 RISC ISA Mapping
HIE1
Group
No. of
instructions
HIE1
IL
(bits)
Instructions Type No. of
RZs
Remarks on
HIE1 format
A 3 8 rfe, syscall,
nop
Exception and
interrupt
0 Common OP
field; iid
differentiates
B 2 16 mfcz, mtcz Data movement
with coprocessor
0 Common OP
field; one-bit iid
differentiates.
The rt is four bits
but the rd is five
bits
C 5 16 jr, mfhi, mflo,
mthi, mtlo
jr is jump register
instruction; others
are data
movement type
0 The OP and fn
fields are similar
to MIPS32. The
4-bit register
field is rs for jr,
mthi and mtlo.
For mfhi and
mflo, the register
field is rd.
D 13 24 add, addu,
and, nor, or,
sllv, srav, srlv,
sub, subu, xor,
slt, sltu
R- type; slt and
sltu are
comparison type;
others are
arithmetic and
logical
0 HIE1 R-Type1.
All fields are
similar to
MIPS32 except
that unused
zeroes are
deleted and the
register fields
are 4 bits
E 3 24 sll, sra, srl R- type; shift 0 HIE1 R-Type2.
All fields are
similar to
MIPS32 except
that the unused
rs field is
deleted and the
register fields
are 4 bits
149
Table 4.6 (Continued)
HIE1
Group
No. of
instructions
HIE1
IL
(bits)
Instructions Type No. of
RZs
Remarks on
HIE1 format
F 5 24 Jalr, div, divu,
mult, multu
R-type; arithmetic 4 HIE1 R-Type3.
All fields are
similar to
MIPS32 except
that 6 unused
zeroes are
deleted and the
register fields
are 4 bits; 4
zeroes maintain
byte alignment.
In jalr, the
register fields
are rs and rd; in
other
instructions,
these are rs and
rt.
G 32 16/
24/32
addi, addiu,
andi, ori, xori,
lui, slti, sltiu,
bczt, bczf, beq,
bgez, bgezal,
bgtz, blez,
bltzal, bltz,
bne, lb, lbu, lh,
lhu, lw, lwcz,
lwl, lwr, sb, sh,
sw, swcz, swl,
swr
I-type
/branch/load/store.
A mixture of
arithmetic/
logical, constant
manipulation,
compare, branch,
load and store
type. Most are of
I-type. The
branch/ load /store
instructions have
offset.
0 HIE1 I- Type. All
fields are similar
to MIPS32
except that the
immediate /
offset field can
take three
different lengths;
0/8/16 bits. In
lui, the rs field
contains 4
zeroes. All
register fields
are 4-bts.
H 2 32 j, jal jump 0 Similar to MIPS.
I 1 32 break Exception and
interrupt
0 Similar to MIPS.
150
4.3 HIE1 EXPERIMENTAL RESULTS
There is a wide variation in the sizes of the benchmark programs.
Out of the 23 embedded applications, four are small (≤ 10KB), four are
medium (10KB-100KB) and fifteen are large (≥ 100KB). The code size
reduction for individual benchmarks in each category is shown in Figures.
4.10 to 4.16. It is observed that there is varying extent of reduction for
embedded programs ranging from 18% to 27%. The three programs-
susan, bitcount and JPEG - get the maximum reduction and the lame
program gets the least reduction. There is very little or nil variation in the
reduction percentages of different applications in network, speech and
security segments. The Automotive and Industrial Control benchmarks
(Figure. 4.10) show reduction varying from 21% to 27%. All three network
benchmarks get around 21.5 % as shown in Figure. 4.11. As shown in
Figure. 4.12, the video program, MPEG2, gets 21% reduction whereas the
two benchmarks in Audio - ADPCM and lame - have noticeable differences.
In image segment, JPEG and susan get higher reduction exceeding 26%
but EPIC and fft give only 21% as shown in Figure. 4.13.
Figure. 4.10: Effect of HIE1 on Automotive and Industrial Control
Benchmarks
151
Figure. 4.11: Effect of HIE1 on Network Benchmarks
Figure. 4.12: Effect of HIE1 on Video and Audio Benchmarks
152
Figure. 4.13: Effect of HIE1 on Image Benchmarks
Figure. 4.14: Effect of HIE1 on Speech Benchmarks
153
Figure. 4.15: Effect of HIE1 on Security Benchmarks
Figure. 4.16: Effect of HIE1 on Text Benchmarks
154
Figure. 4.17: Effect of HIE1 on Embedded Segments
In speech segment, all the three benchmarks have reduction ratios
between 21% to 22% as shown in Figure. 4.14. The four benchmarks in
security segment have reduction ratios between 21% and 22% as shown in
Figure. 4.15. The three benchmarks of text segment show reduction from
19% to 21% as in Figure.4.16. Comparison of reduction percentages
across the segments is shown in Figure. 4.17. Since many segments
contain multiple benchmarks, use of geometric means has been followed
for the reduction percentages. It is observed that the Automotive and
Consumer segments gain maximum with reduction, and the Audio segment
gains least.
4.3.1 Drawback of register size reduction
Though there are many processors such as ARM that manage well
with just 16 registers only, experiments with MiMedia object codes for
MIPS32 reveal a different fact. Out of 23 benchmarks, only susan is not
affected significantly as seen in Figure.3.14. Only in 5% of the cases,
155
susan needs more than 16 registers. This can be easily taken care of by
the compiler. But all other benchmarks use more than 16 registers not less
than 35% of the register accesses. The worst case behaviour is by jpeg
that needs more than 16 registers in 59% of the cases. Eliminating these
accesses by the compiler by restructuring the code will definitely result in
increased number of instructions apart from reducing performance. As per
Figure. 3.15, only benchmarks of two segments, Automotive and industrial
control and image may tolerate reduction of registers from 32 to 16.
However, if susan is deleted from these two segments, then this model will
fare as bad as MIPS16 if not worse than MIPS16.
Another issue to be answered at this point is impact of shift amount
field sa from 5 bits to 4. As seen in Figure. 3.16, only two programs,
bitcount and sha have high figures of using more than 16 bit shifts. Figure.
3.17 confirms that the requirement of most embedded segments lies
between 5% and 11%. Hence HIE1 will not affect the shift operations.
However, HIE1 is not attractive if register usage is heavy. Hence this
technique is recommended only for toys market wherein performance is not
an issue.
4.4 DESIGN OF HIE2
The HIE2 follows a more aggressive reduction policy than HIE1 in
the following aspects:
1. Number of redundant 0's on the OPX/fn field is also reduced. In
HIE1, 24 instructions have redundant 0's whereas the HIE2 has
only 9 instructions with redundant 0's.
2. The HIE2 has 12 different instruction formats whereas in HIE1,
there are nine formats.
156
In HIE2, MIPS32 Instructions are converted into new HIE Plus
instructions of 12 different types by retaining the length of the register field
as 5 bits. But, the maximum length of offset / immediate fields is reduced to
15 bits.
4.4.1 Impact of Reduction of immediate and offset lengths to 15 bits
The 16 bits immediate field is heavily under utilized by embedded
benchmarks as shown in Figure. 3.7. Except for two benchmarks, rsynth
and typeset, other 21 programs need full 16 bits in less than 10% of the
cases. The average requirement is only 5%. As seen in Figure. 3.8, most
embedded segments benchmarks need full 16 bits in 0% to 5% of the
cases only. The benchmarks of Automotive applications and the Speech
segments are in the two extreme ends but within a short range as shown in
Figure. 3.8. Our analysis shows that embedded applications use full 16 bits
of offset field very rarely as shown in Figure. 3.9. The average requirement
is less than 1%. In fact, 16 benchmarks use only 15 bits always. Even the
remaining programs need 16 bits maximum in 3% of the cases. The overall
behavior of embedded segments in this aspect is shown in Figure. 3.10.
Network and Speech segments are satisfied with 15 bits of offset. The
worst case requirement is that of Image segment that has 2% of cases
using more than 15 bit offset. Combining the usage requirements of both
immediate and offset fields, the average figure is less than 6%. Hence the
decision of reducing the immediate and offset fields by 1 bit in HIE2 is a
better choice than reducing register and shift amount fields in HIE1.
4.4.2 HIE2 Design for MIPS32
Since the HIE1 design has been already discussed in depth, this
section will focus on essential differences while discussing HIE2 design. In
HIE2, the instructions are of 12 types as shown in Figure. 4.18. The need
for assigning 66 integer opcodes and reserving some opcodes for future
157
expansion is the main reason for such a large number of types used. The
HIE2 features are as follows.
1. The HIE2 supports four sizes for integer instructions:
(a) Three 8-bit instructions of type A
(b) Twelve 16-bit instructions: 2 type B, 5 type C and 5 type F
(c) Sixteen 24-bit instructions: 8 type D1, 5 type D2 and 3 type E
(d) Three 32-bit instructions: 2 type H and 1 type I, and
(e) Thirty two instructions with multiple options: 26 type G1
instructions with 24/32 bits; 2 type G2 instructions with
8/16/24 bits; and 4 type G3 instructions with16/24/32 bits
2. In all the instructions, the msb bit (IT) indicates the instruction type.
For Types G1, G2 and G3, IT bit is 0 indicating hybrid length fields.
For all other types, the IT bit is 1 indicating fixed length fields.
3. The OP is only 5 bits for all instructions in HIE2. However, the IT bit
is an addition.
4. The instruction identifier (iid) field indicates the exact instructions
within a group of instructions with a common OP. The length of iid is
either 2 or 3 bits.
5. The hybrid length (hl) field indicates the length of offset / immediate
fields as in HIE1.
6. Out of 66 integer instructions, j, jal, and break, are retained as 32
bits as in HIE1.
7. Unlike HIE1, the register fields are retained as 5-bits by HIE2.
8. The syscall, nop, and rfe are 8-bit instructions with a common
opcode (OP = 00001) and a 2-bit iid to identify the instruction. The
iid patterns 00, 01, 10 represent syscall, nop and rfe respectively.
The 11 combination is reserved for future addition.
158
9. The 16-bit instructions are of 3 types: Type B, C and F. In Type B,
the mfcz and mtcz have two different opcodes. In Type C, all the five
instructions have a common opcode. The three bit iid field identifies
the exact instruction. The mthi, mflo, mthi and mtlo have a common
format and the register field is shared between rd and rs. In
mfhi/mflo/mthi/mtlo format, the rd/rs field denotes rd for mfhi and
mflo instructions. For mthi and mtlo instructions, it denotes rs. In
type F, each instruction has a separate opcode.
10. In the 24-bit instructions, the formats of type D1 and type D2 are
similar but with different opcode. The eight instructions in type D1
have a common opcode and a three bit iid field identifies the exact
instruction. Similarly in type D2, there are five instructions sharing a
common opcode but with a different three bit iid field.
11. The 32-bit instructions are of two types. In type H, each of the two
instructions has a separate opcode. In type I, there is only one
instruction.
12. In type G1, there are 26 instructions, each with a separate opcode.
The instruction length is either 24 bit or 32 bit. A one bit hl field
indicates the actual length of immediate /offset field. This offset /
immediate can be either 7 bit or 15 bit.
13. In type G2, there are two instructions with separate opcodes. The
instruction length can be 8/16/24 bits. The length of the offset can be
0/8/16 bits. Two bit hl field identifies the length.
14. In type G3, there are four instructions with a common opcode. A two
bit iid field identifies the exact instruction. The instruction length can
be 16/24/32 bits. The length of the offset can be 0/8/16 bits. A two
bit hl field indicates this as 00, 01, 10 or 11.
The mapping between the MIPS32 ISA and HIE2 ISA is illustrated in
Table 4.7.
159
A 8-bit:nop,syscall,rfe
it opcode iid
1 5 2
B 16-bit:mfcz,mtcz
it opcode rt rs
1 5 5 5
C 16-bit: mfhi, mflo, mthi, mtlo, jr
it opcode iid rd/rs 00
1 5 3 5 2
D1 24-bit: add, addu, and, nor, or, sub, subu, xor
it opcode iid rs rt rd
1 5 3 5 5 5
D2 24-bit: sllv, srav, srlv, slt, sltu
it opcode iid rs rt sa
1 5 3 5 5 5
E 24-bit: sll, sra, srl
it opcode iid rt rd sa 0
1 5 2 5 5 5 1
F 16-bit: jalr, div, divu, mult, multu
it opcode rs rt/rd
1 5 5 5
G1 24/32-bit: addi, addiu, lui, slti, sltiu, beq, bgezal, bltzal, bne, lb, lbu ...
it opcode hl rs rt immediate/offset
1 5 1 5 5 7/15
G2 8/16/24-bit: bczt, bczf
it opcode hl offset
1 5 2 0/8/16 G3 16/24/32-bit: bgez, bgtz, blez, bltz
it opcode iid hl rs 0 offset
1 5 2 2 5 1 0/8/16 H 32-bit: j,jal
It opcode target
1 5 26 I 32-bit: break
It opcode code 0’s
1 5 20 6
Figure. 4.18: HIE2 instruction formats
160
Table 4.7: Mapping MIPS32 ISA to HIE2 ISA
HIE2Type Length
(bits)
No. of
main
Opcodes
allotted
No. of
instructions
allotted
No.
of
Free
slots
Allotted Instructions No.
of RZ
A 8 1 3 1 syscall, nop, rfe 0
B 16 2 2 - mfcz,mtcz 0
C 16 1 5 3 mfhi, mflo, mthi, mtlo, jr 2
F 16 5 5 - jalr, div, divu, mult,
multu
0
D1 24 1 8 - add, addu, and, nor, or,
sub, subu, xor
0
D2 24 1 5 3 sllv, srav, srlv, slt, sltu 0
E 24 1 3 1 sll, sra, srl 1
H 32 2 2 - j, jal 0
I 32 1 1 - Break 0
G1 24/32 26 26 - addi, addiu, andi, ori,
xori, lui, slti, sltiu, beq,
bgezal, bltzal, bne, lb,
lbu, lh, lhu, lw, lwcz, lwl,
lwr, sb, sh, sw, swcz,
swl, swr
0
G2 8/16/ 24 2 2 - bczt, bczf 0
G3 16/24/32 1 4 - bgez, bgtz, blez, bltz 1
4.5 DISCUSSION ON HIE2 RESULTS
Table 4.8 compares the results of both HIE versions. It is observed
that in HIE2, the MiMedia application programs get reduction ranging from
18% to 27% same as in HIE1. Further, it is an interesting coincidence that
on the whole, except for two programs, the other programs get less than
0.5% improvement in HIE2.
161
Table 4.8: Comparison of Code reduction schemes HIE1 and HIE2
Application
Area
Application
Name
HIE1 code
Reduction %
HIE2 code
Reduction %
HIE2
improvement
over HIE1
Automotive
and Industrial
Control
basicmath 20.29 21.55 1.26
bitcount 26.64 26.80 0.16
Qsort 23.67 24.07 0.40
susan 26.80 27.03 0.23
Network
dijkstra 21.48 21.59 0.11
patricia 21.52 21.63 0.11
CRC32 21.49 21.60 0.11
Video MPEG2 20.95 21.30 0.35
Audio
ADPCM 21.49 21.60 0.11
lame 17.77 18.08 0.31
Image
JPEG 26.43 26.97 0.54
EPIC 21.40 21.55 0.15
fft 21.08 21.45 0.37
(susan) 26.80 27.03 0.23
Speech
GSM 21.44 21.54 0.10
G721 21.59 21.70 0.11
rsynth 22.23 22.46 0.23
Security pegwit 21.60 21.70 0.10
sha 22.72 22.83 0.11
blowfish 21.54 21.65 0.11
rjindael 21.65 21.77 0.12
Text
typeset 20.67 20.81 0.14
stringsearch 21.49 21.60 0.11
ispell 19.06 19.08 0.02
Thus the HIE2 gives marginally better results as regard to
percentage of code reduction. However, the HIE2 is more compiler friendly
162
than HIE1 and will have minimum performance reduction. It was expected
that the HIE2 will give poorer results than the HIE1. But the reduction
achieved by HIE2 is either the same as that of HIE1 or marginally higher
than that of HIE1 ranging from 0.02% to 1.26% as seen in Table 4.8. The
domain wise comparison of code size reductions by the two HIE versions is
presented in Figure. 4.19. As expected, HIE2 offers better code reduction
for all the embedded segments though the improvement is negligible in
most cases and the maximum increase is less than 1%. Hence, the HIE2 is
preferable for embedded SoCs developed for most segments of embedded
applications requiring performance.
Figure. 4.19: Code Reduction Comparison between HIE1 and HIE2
4.5.1 Reduction in Memory Accesses in HIE
Greater code density improves static code size. This is particularly
important for several embedded systems, especially microcontrollers, since
it can be a large fraction of the system cost and influence the system's
physical size which has impact on fitness for purpose and manufacturing
163
cost. Improving dynamic code size reduces the amount of bandwidth used
to fetch instructions. This can reduce cost and energy use and can improve
performance. Smaller dynamic code size also reduces the size of caches
needed for a given hit rate; smaller caches can use less energy and less
chip area and can have lower access latency.
Reduction of switching activity per instruction is related to lowering
bus energy consumption by reducing bit toggles per instruction fetch.
However, total fetch energy should take into account memory access
energy and energy of additional HIE-fetch logic/buffer.
Due to code size reduction by HIE, the size of average instruction
has been reduced from 4 bytes to 2.93 bytes onwards up to 3.18 bytes as
shown in Table 4.9. The average instruction size works out to 3.12 bytes.
This indicates the extent of energy saving that can be achieved by HIE in
embedded systems. The number of instruction fetches are reduced by the
same percentage as the code size reduction percentage by HIE. The
number of memory cycles per instruction varies from 0.73 to 080 as shown
in Table. 4.10. The switching power reduces proportionately. However, the
static instruction count is not directly related to bus traffic reduction as there
is always a difference between the number of instructions in the code
(static count) and the number of instructions fetched and executed by the
processor (dynamic count). Nevertheless, depending on the programs, the
dynamic count may be either directly proportional or indirectly proportional
to static count. The study by Benini [38] has established the existence of a
trade-off between static code footprint and fetch bandwidth reduction.
164
Table 4.9: Average Instruction Size in HIE
Embedded
Domain Application
No of RISC
instruction
fetch
No of HIE
instruction
fetch
Reduction in
instruction
fetch in HIE
HIE
Average
no. of
bytes /
instruction
Automotive
and
Industrial
Control
basicmath 1246 994 252 3.18
bitcount 1067 783 284 2.93
Qsort 486 371 115 3.05
susan 12750 9334 3416 2.93
Network dijkstra 115786 90924 24862 3.14
patricia 115936 90995 24941 3.14
CRC32 115375 90584 24791 3.14
Video MPEG2 134082 106005 28077 3.16
Audio ADPCM 115221 90469 24752 3.14
lame 55973 46027 9946 3.29
Image JPEG 17596 12946 4650 2.94
EPIC 123807 97313 26494 3.14
fft 124660 98385 26275 3.16
(susan) 12750 9334 3416 2.93
Speech GSM 127402 100098 27304 3.14
G721 117785 92364 25421 3.14
rsynth 6556 5099 1457 3.11
Security pegwit 127658 100084 27574 3.14
sha 1015 804 211 3.17
blowfish 116651 91532 25119 3.14
rjindael 119116 93333 25783 3.13
Text
typeset 126313 100252 26061 3.17
stringsearch 115574 90747 24827 3.14
ispell 12080 9778 2302 3.24
165
Table 4.10: Memory Cycles for Instruction Fetch in HIE
Embedded
Domain
Application % reduction in HIE
Instruction fetch
No. of memory
cycles per
instruction
Automotive
and
Industrial
Control
basicmath 20.22 0.80
bitcount 20.62 0.73
Qsort 23.66 0.76
susan 26.79 0.73
Network dijkstra 21.47 0.79
patricia 21.51 0.78
CRC32 21.49 0.79
Video MPEG2 20.94 0.79
Audio
ADPCM 21.48 0.79
lame 17.77 0.82
Image JPEG 26.43 0.74
EPIC 21.40 0.79
fft 21.08 0.79
(susan) 26.79 0.73
Speech
GSM 21.43 0.79
G721 21.58 0.78
rsynth 22.22 0.78
Security pegwit 21.60 0.78
sha 20.79 0.79
blowfish 21.53 0.78
rjindael 21.65 0.78
Text
typeset 20.63 0.79
stringsearch 21.48 0.79
ispell 19.05 0.81
166
4.5.2 Reduction in Redundant zeros in HIE
A major parameter contributing to code reduction in HIE is reduction
of redundant zeros in HIE codes compared to the MIPS32 code. In MIPS32
integer instructions, 35 instructions have redundant zeros. On the other
hand, the HIE1 and HIE2 codes have only five and nine instructions
respectively with redundant zeros. Table 4.11 compares the net RZs of
HIE1 and HIE2 codes with RZ of MIPS32 code.
Table 4.12 summarizes the Percentage of Code Reduction (PCR) by
HIE technique for the 23 benchmark programs classified according to their
sizes. Table 4.13 compares the HIE PCR with total wastage including RZ
and WASTIO. Interestingly, extent of code reduction by HIE is
approximately equal to extent of code wastage estimated in Chapter 3. The
PCR is either equal to the total wastage percentage or higher by one or two
percentage. A relationship between the code size reduction in HIE and
three properties of MIPS32 object codes - FTFI, RZ, and WASTIO - was
suspected by us in Chapter 3. Though the FTFI gives an idea about the
scope for code reduction, it is not the only deciding factor. It is noticed that
in most cases, the code size reduction is higher for those programs that
have higher amount of four major instructions and higher amount of under
utilization of immediate and offset fields. This behavior forms the backbone
of our HIE methodology. However, there are marginal exceptional
behaviours by some programs.
167
Table 4.11: Comparison of RZs of Embedded Applications
Embedded
Domain Benchmark
% RZ
MIPS
% RZ
HIE1
% RZ
HIE2
Automotive and
Industrial Control
basicmath 12 0.06 0.17
bitcount 12 0.10 0.37
Qsort 13 0.17 0.41
susan 8 0.02 0.06
Network dijkstra 7 0.07 0.18
patricia 7 0.07 0.18
CRC32 7 0.07 0.18
Video MPEG2 7 0.06 0.17
Audio ADPCM 7 0.07 0.18
lame 5 0.02 0.11
Image JPEG 11 0.26 0.27
EPIC 7 0.07 0.18
fft 7 0.06 0.18
(susan) 8 0.02 0.06
Speech GSM 7 0.06 0.18
G721 7 0.07 0.18
rsynth 11 0.06 0.22
Security pegwit 7 0.06 0.17
sha 5 0.08 0.22
blowfish 7 0.07 0.18
rjindael 7 0.07 0.18
Text typeset 5 0.02 0.06
stringsearch 7 0.07 0.18
ispell 7 0.01 0.22
168
Table 4.12: Typical Code Size Reduction of Embedded Applications
in HIE
PCR Small
(< 10KB)
Medium
(10KB-100KB)
Large
(>100KB)
Below 20 - ispell Lame
20-25 basicmath,
qsort, sha
rsynth typeset, fft, CRC 32,
dijkstra, patricia,
blowfish, rijndael,
adpcm, gsm, pegwit,
mpeg2, g721, epic,
stringsearch
Above 25 bitcount susan jpeg
For instance, the sha has only 42% of four major instructions and
only 16% of the code is wasted due to under utilization of immediate and
offset fields. In spite of this, there is 23% code size reduction with HIE for
the sha. This could be due to increased number of R-type instructions in
the MIPS32 code for the sha. These instructions have been reduced to 24
bits in the HIE-MIPS code.
169
Table 4.13: Relationship between RZ, WASTIO AND HIE PCR
Embedded
Domain Benchmark
%
RZ
%
WASTIO
A = % RZ
+ %
WASTIO
B =
HIE
PCR
Difference
B-A
Automotive
and
Industrial
Control
basicmath 12 9 21 22 1
bitcount 12 14 26 27 1
Qsort 13 11 24 24 0
susan 8 17 25 27 2
Network
dijkstra 7 13 20 22 2
patricia 7 13 20 22 2
CRC32 7 13 20 22 2
Video MPEG2 7 13 20 21 1
Audio ADPCM 7 13 20 22 2
lame 5 12 17 18 1
Image JPEG 11 14 25 27 2
EPIC 7 13 20 22 2
fft 7 13 20 21 1
(susan) 8 17 25 27 2
Speech
GSM 7 13 20 22 2
G721 7 13 20 22 2
rsynth 11 11 22 22 0
Security pegwit 7 13 20 22 2
sha 5 17 22 23 1
blowfish 7 13 20 22 2
rjindael 7 13 20 22 2
Text typeset 5 16 21 21 0
stringsearch 7 13 20 22 2
ispell 7 12 19 19 0
170
4.6 PROCESSOR MODIFICATIONS TO SUPPORT HIE
The processor has to manage the non uniform size of instructions in
HIE. The instruction fetch logic requires dual instruction buffers. The
instruction decoder needs to be more exhaustive. In case of fixed
instruction encoding, the instruction length is known to the processor in
advance. Hence, during sequential program execution, the PC has to be
incremented by the length of instruction in bytes. In HIE, the instructions
have different lengths. Hence the instruction being executed must be
decoded to decide its width. This is somewhat less of a problem for
implementations that only decode one instruction per cycle. For wider
decode (superscalar) several tricks are available to reduce the cost of
parsing out individual instructions from a block of instruction memory. One
technique is to use marker bits to indicate the start or end of an instruction.
Such marker bits would be set for each parcel of instruction encoding and
stored in the instruction cache. Several AMD x86 implementations have
used marker bit techniques. Alternatively, marker bits could be included in
the instruction encoding as done in HIE. This places some constrains on
opcode assignment and placement since the marker bits effectively
become part of the opcode. Another technique, used by the IBM zSeries
(S/360 and descendants), is to encode the instruction length in a simple
way in the opcode in the first parcel. The zSeries uses two bits to encode
three different instruction lengths (16, 32, and 48 bits) with two encodings
used for 16 bit length. By placing this in a fixed position, it is relatively easy
to quickly determine where the next sequential instruction begins. With
hybrid/variable length instructions, part of the opcode must often be
decoded before the basic parsing of the instruction can be started as
discussed in Chapter 6. This tends to delay the availability of register
names and other, less critical information. With more complex
implementations (deeper pipelines, out-of-order execution, etc.), the extra
relative complexity of handling variable length instructions is reduced. After
instruction decode, a sophisticated implementation of an ISA with variable
171
length instructions tends to look very similar to one of an ISA with fixed
length instructions.
In order to handle interrupts, the PC should have an auxiliary
register. The example of Motorola processor MC 680X0 is relevant to our
case. The MC 680X0 uses variable length instructions, some using a
second word complete instruction specification, whereas others use
extension words to complete certain address specifications. There are two
PCs: one a conventional PC register and the second a scanPC register that
is not visble to the programmer. The scanPC keeps tracking the words that
have been interpreted. For instruction retry, and for return after interrupts
that occur within delay slots, conventional PC is used. For instruction
fetches, the scanPC is used. It is obvious that any processor supporting
variable-length instructions have to check whether each instruction
straddles a cache line or virtual memory page boundary. This aspect is
further discussed in Chapter 6.
In HIE, the instruction decoder should expand the immediate and
offset fields to 16 bits based on hl bit pattern. Basically, the four different
actions needed are as follows:
1. If hl=00, 16 zeros should be forced into the offset/immediate
register.
2. If hl=01, eight zeros should be entered in the most significant
byte position of the offset/immediate register and the eight bit
contents offset/immediate field in the instruction should be
copied to the least significant byte position of the
offset/immediate register.
3. If hl=10, eight zeros should be entered in the least significant
byte position of the offset/immediate register and the eight bit
contents offset/immediate field in the instruction should be
172
copied to the most significant byte position of the
offset/immediate register.
4. Of hl=11, the offset/immediate field should be copied to the
offset/immediate register.
The above changes are minor in nature compared to the larger
gains in code size reduction since the area occupied by the processor is
typically around 5% of a SoC whereas the memory area takes several
times depending on the nature of embedded application.
4.7 CONCLUSIONS
This chapter has proposed Hybrid Instruction Encoding in place of
Fixed Instruction Encoding so as to reduce the code memory size in SoCs.
An HIE-ISA has been proposed for RISC processors supporting multiple
instruction sizes, and four options for immediate and offset fields.
Simulation of HIE has been done with four instruction sizes for MIPS32
processor and the results show code size reduction up to 27%.
Experiments have been conducted with twenty three benchmark programs
collected from MiBench and MediaBench suites, using the custom built
static simulator. It was noticed that except for two programs all others got
reduced by more than 20%. Whereas one large program got reduced by
more than 25% and another large program got reduced by less than 20%,
the remaining 14 large programs got reduced between 20% - 25%.
Considering the significant extent of savings in code memory and chip
space in SoCs, development of dedicated HIE-RISC processor cores for
the embedded market is recommended.
The instruction fetch and decode logics need to manage hybrid
instruction lengths and multiple sizes of offset and immediate fields. These
hardware changes do not need much additional space in the processor.
The processor itself occupies lesser area than the on-chip memory in
173
embedded SoCs, and hence the HIE reduces the overall chip area for
SoCs. In HIE2, the immediate/offset field has been made 15 bits compared
to 16 bits of MIPS32. It has been established that most embedded
programs will have negligible impact. This research work has estimated the
static code size reduction for HIE based ISA, and dynamic simulation is not
done for evaluating performance and power consumption. Marginal
performance reduction can be tolerated for BOPES in view of savings in
chip space and power consumption. However, use of parallelism with
superscalar architecture or multicore SoCs will compensate performance
loss due to single processor core. There are additional techniques of code
size reduction that can be integrated with HIE to increase the extent of
code size reduction. The next two chapters deal with such techniques.
174
5. REGISTER MEMORY ARCHITECTURE FOR RISC CORES
The Load-Store architecture (LSA) of RISC processors is one of the
factors contributing to increase in code memory space in embedded
systems. This chapter explores achieving code size reduction by
incorporating Register-memory architecture (RMA) in embedded RISC
processors. Modifications required in an existing RISC processor to
incorporate RMA arithmetic/logical instructions are discussed. As a case
study, MIPS32 instruction set is enhanced with 12 new instructions
supporting memory operations. The experiments on object codes of
MiMedia benchmarks yield varying extent of code size reduction. The rest
of the chapter is organized as follows. Section 5.1 discusses the motivation
for the RMA architecture. Section 5.2 explains the methodology used for
introducing RMA in MIPS processor. Section 5.3 gives measurements
estimating the resulting code space reduction due to RMA and discusses
the results. Section 5.4 discusses the hardware changes required in RISC
processors to support RMA. Section 5.5 presents the conclusions.
5.1 MOTIVATION FOR RMA
The original objective of choosing LSA in RISC architecture is to
simplify the instruction pipeline and increase processor performance. In
load-store ISA, only load and store instructions can access memory
operands and the arithmetic/logical instructions can only operate on
register operands. Since arithmetic and logical operations on memory
operands are not permitted, the compiler should use a load instruction,
before an add instruction, to move the data from memory to register.
Similarly, the result of an add instruction is stored by the processor in a
register only. Hence a store instruction has to be used by the compiler,
after the add instruction, for moving the result to main memory. This
restriction results in 50% to 100% increase in data transfer instructions in
RISC processors as pointed out in Chapter 1. The extent of usage of data
175
transfer and arithmetic instructions in MIPS object codes for the MiMedia
Benchmarks has been estimated using MIDA.
Table 5.1: Data transfer Vs Arithmetic Instructions in Embedded
Applications
Embedded
domain
Benchmark No. of Integer
instructions
% of LSI
Class
% of
ALUI
Class
Automotive
and Industrial
Control
basicmath 905 18 30
bitcount 1047 36 42
Qsort 463 24 49
susan 12317 50 35
Network dijkstra 114034 35 40
patricia 114181 35 40
CRC32 113623 35 40
Video MPEG2 137957 32 40
Audio ADPCM 113469 35 40
lame 44416 27 37
Image JPEG 17267 39 36
EPIC 121390 34 40
fft 119767 33 39
(susan) 12317 50 35
Speech GSM 124978 34 41
G721 116002 35 40
rsynth 6202 25 43
Security pegwit 125631 34 41
sha 1038 59 31
blowfish 114810 35 40
rjindael 117170 35 41
Text typeset 123805 47 33
stringsearch 113827 35 40
ispell 12021 31 43
176
Table 5.1 lists the percentage distribution of data transfer
instructions and arithmetic instructions in object codes of 23 different
benchmark programs. For the ALU instructions (ALUI), the operands are
in registers and the results are stored in registers. In Load and Store
instructions (LSI), one operand is in register and the other operand is in
memory.
The address of the memory operand is generally specified as
the sum of two parts: the base register contents and an offset in the
immediate field. The proposal in this thesis is to support one memory
operand in ALU instructions thereby adding a new class of register-
memory instructions in addition to existing register-register ALU
instructions. As a result, the instruction set is marginally enhanced.
Though compiler and processor modifications are required, these are one
time efforts by the processor manufacturers/ compiler developers and there
is no additional burden on embedded system developers. Also it is a
program independent solution for embedded applications. This strategy
can be combined with other methods of code size reduction thereby
achieving additional amount of code size reduction.
5.2 METHODOLOGY FOR RMA ALU INSTRUCTIONS
Most arithmetic operations can be carried out in MIPS both in R-
Type format and I-Type format. In R-type addition, both the source
operands are available in registers and the result is placed in the
destination register. In I-type addition, one source operand is in a register
and the other source operand is available in the instruction as immediate
operand. There are four different Add instructions in MIPS as defined in
Table 5.2. The first two instructions, ADD and ADDU, follow R-Type format.
In both these instructions, two register contents are added. The ADD can
cause overflow exception in which case there is no result; for the ADDU,
overflow cannot occur. The ADDI and ADDIU follow I- Type format and
177
there is no wastage of instruction length. These two instructions add
register content with an immediate operand present in the instruction itself.
Table 5.2: MIPS ADD Instructions
MIPS Instruction
Meaning Purpose Description
ADD Add Word
To add 32-bit integers. rt is added with rs. If
no ‘overflow’, the
result is stored in rd.
ADDU Add
Unsigned
Word
To add 32-bit integers;
modulo arithmetic.
Applicable for address
calculation, or integer
arithmetic
environments
that ignore overflow.
rt is added with rs and
the arithmetic result is
stored in rd.
ADDI Add
Immediate
Word
To add a constant to a
32-bit integer.
The signed immediate
is added with rs.
Storing result is
similar to ADD.
ADDIU Add
Immediate
Unsigned
Word
To add a constant to a
32-bit integer;
functionally similar to
ADDU.
The signed immediate
is added with rs and
the arithmetic result is
stored in rd.
5.2.1 Formats for ADDrm Instruction for MIPS
Both R-Type and I-Type instruction formats have been designed for
the ADDrm instruction for MIPS as shown in Figures. 5.1 and 5.2. Table 5.3
lists the four types of RMA-add instructions supporting a memory operand.
In the RM-Type Format, the register Rs-l is used as the base register and
178
an eight bit offset specifies the memory operand. The register operand is in
Rt-rm and the Rd-r field gives the destination register operand retaining the
three address format.
Table 5.3: Proposed RMA ADD Instructions for MIPS
Instruction Meaning Purpose Description
ADD-RM Add Word
–reg,mem
To add two 32-bit
integers, one of them
is in a register and the
other in memory.
The signed offset is
added to the base, Rs-
l, to get the address of
the first operand. The
memory word is added
with Rt-rm. If no
‘overflow’, the result is
stored in Rd-r.
ADDU-RM Add
Unsigned
Word-
reg,mem
To add 32-bit integers,
one of them is in a
register and the other
in memory; functionally
similar to ADDU.
The signed offset is
added to the base, Rs-
l, to get the address of
the first operand. The
memory word is added
with Rt-rm and the
arithmetic result is
stored in Rd-r.
ADDI-RM Add
Immediate
Word-
reg,mem
To add a constant to a
32-bit integer in
memory.
The signed immediate
is added to memory
word at the address
formed by Rs-l and
offset. Functionally
similar to ADD-RM. If
no ‘overflow’, result is
stored in Rt-i.
ADDIU-RM Add
Immediate
Unsigned
Word-
reg,mem
To add a constant to a
32-bit integer in
memory; functionally
similar to ADDU-RM.
The signed immediate
is added to the memory
word at the address
formed by Rs-l and
offset and the
arithmetic result is
stored in Rt-i.
179
Op Rs-l Rt-rm Rd-r Offset-l Opx-rm
31-26 25-21 20-16 15-11 10-3 2-0
Figure. 5.1: RMA Instruction Format – RM Type
Op Rt-i Rs-I Offset-l Immediate
31-26 21-25 20-16 15-8 7-0
Figure. 5.2: RMA Instruction Format – IM Type
In the IM Format, Rs-l is used as a base register and an eight bit
offset specifies the memory operand. The immediate field gives the other
operand. The Rt-i gives the destination register. For generating memory
address, the datapath already present for load type instructions can be
used. Hence only additional control signals have to be generated by the
opcode decoder.
5.2.2 Proposed RMA opcodes
The RMA instructions introduce two new formats. The 3-lsbs (bits
2-0) define the nature of operation whereas the 6-msbs (bits 31-26) gives
the operation type, for RM format. For IM format, the 6-msbs alone indicate
the operation. The offset and immediate fields are only 8-bits that may pose
a challenge to the compiler. MIDACC converts LSA instructions into RMA
instructions only if the length limitation is satisfied. Strategy used by the
RMA simulator to modify the R-Type and I-Type ADD instructions and
generating a composite RMA instruction is as follows. MIDACC scans the
MIPS object codes and estimates the scope for RMA instructions in the
given program by a search for appropriate sequences of Load word
(LW)/Store word (SW) and ALU instructions. The formats of LW and SW
instruction are shown in Figure.5.3.a and Figure. 5.3.b respectively.
180
OP Rs-l Rt-l Offset-l
31-26 25-21 20-16 15-0
Figure. 5.3 (a): Format of LW Instruction
OP Rs-s Rt-s Offset-s
31-26 25-21 20-16 15-0
Figure. 5.3 (b): Format of SW Instruction
The MIDACC inserts RMA Instructions by performing following
actions (a)-(e)
(a) Scans the object code for the ‘LSA sequence’ such as Load
preceding an ALU type instruction or Store following an ALU
type instruction.
(b) Determines whether the ‘LSA sequence’ qualifies for RMA
conversion or not.
(c) If it is a qualifying sequence, then the ‘LSA sequence’ is
deleted and appropriate RMA instruction is inserted. Further, all
the subsequent instruction addresses are decremented by 4.
(d) An RMA table is created indicating the original instruction
addresses where-all the RMA instructions have been inserted
(for dynamic simulation assistance).
(e) Further, a ‘Step-1 Jump address table’ is created indicating
original JUMP instruction address and step-1compressed
JUMP instruction address and new jump target address.
181
Table. 5.4: Types of ALU instruction for RMA load sequence
MIPS
instruction Format
Opcode, op
(bits 31-26)
Opx (Function)
(bits 5-0)
ADD R 000000 100000
ADDU R 000000 100001
ADDI I 001000 NA
ADDIU I 001001 NA
SUB R 000000 100010
SUBU R 000000 100011
AND R 000000 100100
ANDI I 001100 NA
OR R 000000 100101
ORI I 001101 NA
XOR R 000000 100110
NOR R 000000 100111
Table. 5.5: Types of ALU instruction for RMA store sequence
MIPS
Instruction Format
Opcode, op
(bits 31-26)
Opx/
Function (bits 5-0)
ADD R 000000 100000
ADDU R 000000 100001
SUB R 000000 100010
Table 5.4 is used to identify ALU type instruction in the RMA code
conversion for the sequence of Load preceding an ALU type instruction.
The Table 5.5 is used to identify ALU type instruction in the RMA code
conversion for the sequence of Store instruction following an ALU type
182
instruction. Qualifying condition for LW preceding an R-Type ALU
Instruction (Figure. 5.4) are as follows:
1. Numeric value of ‘Offset-l’ should not exceed 8-bits.
2. Rt-l should be equal to either Rs-r or Rt-r. If Rt-l is equal to
Rs-r, Rs-r is dropped and Rt-r is renamed as Rt-rm. If Rt-l is
equal to Rt-r, Rt-r is dropped and Rs-r is renamed as Rt-rm.
000000 Rs-r Rt-r Rd-r 00000 100000
Figure. 5.4: R-Type ADD instruction in MIPS
Qualifying condition for LW preceding I-Type ALU Instruction
(Figure. 5.5) is as specified below:
1. Numeric value of ‘Immediate’ should not exceed 8-bits.
2. Rt-l should be equal to Rs-i. If it is equal, the Rt-l is dropped.
001000 rs-i rt-i Immediate
Figure. 5.5: I-Type ADD instruction in MIPS
Qualifying condition for SW following R-Type ALU Instruction:
1. Numeric value of ‘Offset-s’ should not exceed 8-bits.
2. Rd-r should be equal to Rt-s. If Rt-s is equal to Rd-r, both Rt-s
and Rd-r are dropped.
183
Table 5.6: RMA Instructions Corresponding to MIPS32 Instructions for
Load Sequence
MIPS32 instruction New RMA Instruction
Instruction
Name
Type Opcodeop
(bits 31-
26)
Opx
Function
(bits 5-0)
Instruction
Name
Type OP (bits
31-26)
Ox-rm
(bits 2-
0)
ADD R 000000 100000 ADD-rm RM 101101 000
ADDU R 000000 100001 ADDU-rm RM 101101 001
ADDI I 001000 NA ADDI-rm IM 101111 NA
ADDIU I 001001 NA ADDIU-rm IM 110101 NA
SUB R 000000 100010 SUB-rm RM 101101 010
SUBU R 000000 100011 SUBU-rm RM 101101 011
AND R 000000 100100 AND-rm RM 101101 100
ANDI I 001100 NA ANDI-rm IM 110110 NA
OR R 000000 100101 OR-rm RM 101101 101
ORI I 001101 NA ORI-rm IM 111110 NA
XOR R 000000 100110 XOR-rm RM 101101 110
NOR R 000000 100111 NOR-rm RM 101101 111
Table 5.7: RMA Instructions Corresponding to MIPS32
Instructions for Store Sequence
MIPS32 instruction New RMA-st Instruction
Instruction Type Opcode,
op
(bits 31-
26)
Opx/
Function
(bits 5-0)
Instruction Type OP
(bits
31-26)
Ox-
rmst
(bits
2-0)
ADD R 000000 100000 ADD-rmst RM 011111 101
ADDU R 000000 100001 ADDU-rmst RM 011111 110
SUB R 000000 100010 SUB-rmst RM 011111 111
184
Tables 5.6 and 5.7 gives the RMA instructions corresponding to the
MIPS instruction created for a load sequence and store sequence
respectively.
5.2.3 Estimates on Code Size Reduction
The extent of code size reduction with RMA instructions varies with
the nature of embedded programs. Static simulation has been conducted
with MiMedia source programs using MIDACC and the results are
discussed in section 5.3. A sample illustration is given in Figure. 5.6 with a
C code for a loop operating on an array of 100 elements [27]. The
assembly codes and object codes for both LSA and RMA are manually
generated.
C Code for a Loop with variable array index [12]
Loop: g = g + A [i] ;
i = i + j ;
if (I ! = h) go to Loop ;
Assembly Code for LSA Object Code for LSA (All numbers in decimals)
Comments Address object code
Loop : Add r7, r3, r3 ; r7 = 2 * i 8000 0 3 3 7 0 32
Add r7, r7, r7 ; r7 = 4 * i 8004 0 7 7 7 0 32
Add r7, r7, r5 ; r7 = address of A[i] 8008 0 7 5 7 0 32
Lw r6, 0 (r7) ; r6 = A[i] 8012 35 7 6 0
Add r1, r1, r6 ; g = g + A [i] 8016 0 1 6 1 0 32
Add r3, r3, r4 ; i = i + j 8020 0 3 4 3 0 32
Bne r3, r2, Loop ; go to loop if i ≠ h 8024 5 3 2 -24
185
Assembly Code for RMA Object Code for RMA (All numbers in decimals)
Comments Address object code
Loop : Add r7, r3, r3 ; r7 = 2 * i 8000 0 3 3 7 0 32
Add r7, r7, r7 ; r7 = 4 * i 8004 0 7 7 7 0 32
Add r7, r7, r5 ; r7 = address of A[i] 8008 0 7 5 7 0 32
Addrm r1, r1, 0 (r7) ; g = g + A [i] 8012 0 7 1 1 0 40
Add r3, r3, r4 ; i = i + j 8016 0 3 4 3 0 32
Bne r3, r2, Loop ; go to loop if i ≠ h 8020 5 3 2 -20
Figure. 5.6: Comparison of object codes of LSA and RMA
The array’s base is in register r5; registers r1, r2, r3 and r4 are
allotted for variables g, h, i and j; r6 and r7 are used as temporary registers.
As can be noticed easily, the LSA assembly code needs seven statements
whereas the RMA assembly code needs only six statements. The LSA object
code occupies 28 bytes whereas the RMA object code occupies 24 bytes.
This amounts to roughly 14% reduction in code space. During execution of the
program, the loop is executed 100 times. Hence in absolute terms, it means
considerable amount of reduction in memory I/O bandwidth resulting in large
power reduction.
5.3 RESULTS AND DISCUSSION
MIDACC has been used to simulate the RMA environment for object
codes of MiMedia suite. The experiments are conducted on Intel PC under
Linux operating system. The object codes are converted from LSA to RMA
by MIDACC for estimating the code size reduction due to RMA for various
embedded applications. In addition to inserting the RMA instructions, it also
generates the compressed code that can be used as input to the linker.
However, further work is necessary for dynamic simulation.
186
Figures. 5.7 to 5.13 depict the results obtained for static simulation for 23
selected embedded programs. Figure. 5.14 compares RMA code
reduction for embedded domains.
Figure. 5.7: Code size Reduction due to RMA for Automotive Domain
Figure. 5.8: Code size Reduction due to RMA for Network Domain
187
Figure. 5.9: Code size Reduction due to RMA for Video and Audio
domains
Figure. 5.10: Code size Reduction due to RMA for Image Domain
188
Figure. 5.11: Code size Reduction due to RMA for Speech Domain
Figure. 5.12: Code size Reduction due to RMA for Security Domain
189
Figure. 5.13: Code size Reduction due to RMA for Text Domain
Figure. 5.14: Comparison of Code size Reduction due to RMA for
Embedded Domains
190
It is surprising that except two programs - susan and bitcnts - others
get less than 5% reduction. The maximum reduction is 18% for the susan.
Compared to the reduction percentages obtained by HIE, the reduction
with RMA is almost negligible for most applications. Hence, viewed in
isolation, it is not advantageous to incorporate RMA in RISC processors.
However, RMA can be combined with HIE to increase the effective
reduction of the code size. Further, as discussed in Chapter 6, use of
compound instructions gives additional code size reduction. The final
conclusion is presented in section 5.5 after discussion of the hardware
modifications required in section 5.4.
5.4 PIPELINE MODIFICATIONS FOR RMA
The RMA will impact the entire processor architecture as regard
to opcodes, addressing modes, instruction length, instruction encoding,
datapath, control signals etc. This involves both hardware changes in
processor cores and modifications to the compilers and associated
software tools.
The traditional RISC pipeline sequence discussed in Chapter 2 has
to be rearranged to suit both LSA and RMA instructions by interchanging
the Data memory access and Execute stages. Figure. 5.15 shows the
proposed 6-stage pipeline that supports the RMA Register-Memory
instructions, for RISC processors. The action taken by the pipeline for the
ADD-RM instruction is as follows. The first two stages are similar to the
RISC pipeline. In the third stage (AC), memory address, for the memory
operand, is calculated by a small address adder (as in ARM processors)
and in the fourth stage (MEM), the memory operand is fetched from the
data memory. In the fifth stage (EX), addition is carried out and in the last
stage (WB), the result is stored in destination register.
191
Figure. 5.15: Proposed 6-Stage RMA Pipeline
Figure. 5.16 shows the execution of LSA instructions in the six
stage pipeline. For the LSA ADD instruction, the AC and MEM cycles are
unused. Compared to a 5-stage RISC pipeline, one additional clock cycle is
wasted (consumed) for LSA instructions in the 6-stage RMA pipeline. The
unused internal cycles do not directly affect performance, since they do not
cause pipeline stalls [54]. However, a slight performance decrease is
expected due to increase in pipeline length in view of increased frequency
of dependencies between successive instructions. It is a question of choice
between performance and code size. For embedded systems, code size is
more significant, and hence the increase in execution time by one cycle is
tolerable.
192
Figure. 5.16: Execution of LSA Instructions in 6-Stage RMA Pipeline
An alternate approach is possible for the RMA pipeline with 5 stages
as shown in Figure. 5.17 in which the EX and MEM stages are combined
as a single MEM/EX stage to improve the efficiency of the RMA pipeline for
the LSA instructions. In the 6-stage pipeline, the EX stage is unused by
LOAD and STORE instructions of LSA, and the MEM stage is unused by
LSA ADD instruction as shown in Figure. 5.16. Hence in the 5-stage RMA
pipeline, the EX and MEM stages can be combined into a single MEM/EX
stage. Therefore, no performance penalty is caused for LSA instructions.
193
Figure. 5.17: Execution of LSA Instruction in 5-stage RMA pipeline
Figure. 5.18: Execution of RMA ADDrm Instruction in 5-Stage RMA
pipeline
194
For the ADD-rm instruction, the MEM/EX stage is recycled as shown
in Figure.5.18. In the first MEM/EX cycle, the memory operand is fetched
from the data cache and, in the second MEM/EX cycle, the addition is
performed. As a matter of fact, such 5-stage approach has been used in
several processors including Pentium, R8000 and PA 7100 [54]. The
hardware changes for the RMA arithmetic/logical instructions will
definitely reduce the performance of the processors to some extent due
to pipeline stalls during memory access. However, since use of cache
memory is common in present day embedded systems also, chances of
pipeline stall will be rare.
5.5 CONCLUSIONS
This chapter investigates implementation of register-memory
architecture for embedded processors for low power embedded systems in
view of the need for reduced chip size and lower power consumption.
Encoding of appropriate new instructions and pipeline modifications of
existing RISC processors to support RMA arithmetic/logical instructions has
been experimented demonstrating addition of 12 RMA instructions for
MIPS32. In view of pipeline changes required, one may conclude that the
RMA idea is not cost effective. However, it is essential to view the SoC as a
whole and not the processor alone in isolation. Additional hardware
components required for incorporating the processor changes are almost
free of cost and generation of control signals is not that complex. But
increasing the memory size is costlier. However, performance degradation
due to pipeline stalls, during memory access for RMA ALU instructions, is a
matter of concern. With multicore and superscalar architectures,
performance degradation can be compensated by appropriate scheduling
of instructions. The integration of RMA with HIE is discussed in Chapter 6.
Similarly, combining the idea of compound instructions with HIE and RMA
offers additional code size reduction as shown in Chapter 6.
195
6. HYBRID PROCESSOR FOR PORTABLE EMBEDDED
SYSTEMS
The emergence of the Internet of Things (IoT) and the insatiable
demand for smart devices in every aspect of life is driving a complete
overhaul of traditional wisdom in the embedded processor and embedded
memory industry. As electronic devices become smarter, the software code
becomes larger and needs to be processed faster to handle the
communication protocols, authentication, message generation, and so on.
The reality is now dawning on the industry that RISC architecture cannot
meet this new generation of code storage requirement, with embedded
software increasing quickly from a few Kilobytes to several Megabytes.
CORE-based design with predefined and preverified modules in
modern SoCs is the state-of-the art design strategy. In this thesis, the term
processor is used in a broad sense. It includes both core IP in a SoC and
traditional single-chip form since the basic concept of the processor is the
same. Both are the same architecturally irrespective of the different
implementations.
The main goal of the research work is to provide an enhanced ISA
for RISC processor cores so as to produce minimum object code for SoC
based BOPES. This Chapter proposes designing a hybrid processor
incorporating both HIE and RMA along with certain other ISA level
enhancements to meet the code size requirements of portable embedded
systems. To estimate the code size reduction possible with such a
processor, MIPS32 is taken as the target processor and the embedded
object codes of MIPS32 are recoded to HIE-RMA-MIPS so that the overall
code size reduction is measured. The MIDACC has been suitably extended
to work on HIE object codes and performing HIE on RMA codes. In
addition, certain compound and composite instructions are proposed in this
chapter to further reduce the code size.
196
6.1 SoC DESIGN AND EMBEDDED SYSTEMS
Present day typical deep-submicron IC design poses challenges to
SoC design teams. In a generic 0.13 µ standard-cell foundry process,
silicon density routinely exceeds 100,000 usable gates per square
millimetre. Consequently, a low-cost chip with a core area of 50 square
millimetres can carry 5 million logic gates. All embedded systems now
contain significant amounts of software. Three examples for the SoC based
battery powered portable embedded systems, namely smart watch,
scanner and smartphones, are briefly discussed in the following
subsections.
6.1.1 Smart watch
While early models of the smart watch, a computerized wristwatch,
performed basic tasks, such as calculations, translations, and game-
playing, modern smart watches are effectively wearable computers. Many
smart watches run mobile apps, while some models run a mobile operating
system and function as portable media players, offering playback of FM
radio, audio, and video files to the user via a Bluetooth headset. Some
smartphone models, also called watch phones, feature full mobile phone
capability, and can make or answer phone calls. A smart watch may collect
information from internal or external sensors. It may control, or retrieve data
from, other instruments or computers. It may support wireless technologies
like Bluetooth, Wi-Fi, and GPS. However, it is also possible that a
wristwatch computer may merely serve as a front end for a remote system,
as in the case of watches utilizing cellular technology or Wi-Fi. Figure. 6.1
illustrates the block diagram of a smart watch.
197
Figure. 6.1: Block diagram of smart watch
Smart watch may include features such as a camera, accelerometer,
thermometer, altimeter, barometer, compass, chronograph, calculator, cell
phone, touch screen, GPS navigation, map display, graphical display,
speaker, scheduler, watch, Secure Digital (SD) cards as a mass storage
device, and rechargeable battery. It may communicate with a wireless
headset, heads-up display, insulin pump, microphone, modem, or other
devices. Some also have "sport watch" functionality with activity tracker
features (also known as fitness tracker) as seen in GPS watches made for
training, diving, and outdoor sports. Functions may include training
programs (such as intervals), lap times, speed display, GPS tracking unit,
route tracking, dive computer, heart rate monitor compatibility, cadence
sensor compatibility, and compatibility with sport transitions.
6.1.2 Scanner
The basic function of a scanner is to analyze an image and process
it. Image and text capture allows saving information to a file on the
computer. Handheld 3D scanners are used in industrial design, reverse
engineering, inspection and analysis, digital manufacturing and medical
198
applications. Colour scanners typically read Red-Green-Blue (RGB) colour
data from the array. This data is then processed with some proprietary
algorithm to correct for different exposure conditions, and sent to the
computer via the device's input/output interface. By combining full-colour
imagery with 3D models, modern hand-held scanners are able to
completely reproduce objects electronically. The addition of 3D colour
printers enables accurate miniaturization of these objects, with applications
across many industries and professions. The size of the file created
increases with the square of the resolution. The file size can be reduced for
a given resolution by using "lossy" compression methods such as JPEG, at
some cost in quality. If the best possible quality is required, lossless
compression should be used; reduced-quality files of smaller size can be
produced from such an image when required. A scanning utility and some
type of image-editing application (such as Photoshop), and optical
character recognition (OCR) software are required. The OCR software
converts graphical images of text into standard text that can be edited
using common word-processing and text-editing software. Figure. 6.2
shows the block diagram of a typical hand held scanner designed around a
SoC.
Figure. 6.2: Block diagram of a scanner
199
6.1.3 Smartphones
A smartphone (or smart phone) is a mobile phone with more
advanced computing capability and connectivity than basic feature phones.
Early smartphones typically combined the features of a mobile phone with
those of another popular consumer device, such as a personal digital
assistant (PDA), a media player, a digital camera, and/or a GPS navigation
unit. Later smartphones include all of these plus the features of a
touchscreen computer, including web browsing, Wi-Fi, and third-party
applications.
The software for smartphones can be visualized as a software stack.
The stack consists of the following layers:
1. kernel -- management systems for processes and drivers for
hardware
2. middleware -- software libraries that enable smartphone
applications (such as security, Web browsing and messaging)
3. application execution environment (AEE) -- application
programming interfaces, which allow developers to create their
own programs
4. user interface framework -- the graphics and layouts seen on
the screen
5. application suite -- the basic applications users access
regularly such as menu screens, calendars and message
inboxes
Figure. 6.3 gives a functional view of Snapdragon, a popular SoC
used in most smartphones.
200
Figure. 6.3: Block diagram for the Snapdragon S4 SoC
using Krait CPUs
Inside the Snapdragons, not only the processing cores, graphics
chip and media accelerators are present in single SoC but also present in
the package are full wireless radios, GPS and RAM. MediaTek has just
launched MT6795, a 64-bit Octa-Core SoC for High-End Smartphones.
6.2 ENHANCEMENT TO HIE AND RMA CODES
The following additional techniques were applied to get higher code
reduction for the MIPS object codes by suitable extensions to MIDAAC.
1. Introducing 16-bit compound instructions, of two-address, in
HIE2 to replace instructions with same two source registers,
and instructions with same source and destination register.
2. Introducing two composite instructions - loadmultiple and
storemultiple - to replace consequent load/store instructions.
201
3. Performing HIE simulation on the RMA codes to estimate
combined effects of HIE and RMA.
The results of all three techniques are shown in Table 6.1. It is
interesting to note that the two benchmarks - susan and bitcount - that gave
excellent code reduction both for HIE and RMA, suffered maximum due to
Table 6.1: Integration of Code reduction schemes and compound/
composite instructions
Benchmark
A = HIE
PCR +
RMA
PCR
B = Actual
HIE-RMA
PCR
C = PCR
due to
Compound
instructions
D = PCR
due to
Composite
instructions
Overall
Reduction
PCR =
B+C+D
basicmath 24.12 22.02 1.16 1.47 24.65
bitcount 34.11 30.62 3.49 3.37 37.48
Qsort 27.16 26.08 3.46 3.38 32.92
susan 44.72 36.30 6.91 1.23 44.44
dijkstra 25.13 23.72 2.39 4.18 30.29
patricia 25.16 23.73 2.39 4.20 30.32
CRC32 25.13 23.72 2.40 4.19 30.31
MPEG2 24.57 23.27 2.37 4.07 29.71
ADPCM 25.12 23.71 2.40 4.19 30.30
lame 20.02 19.10 0.79 2.06 21.95
JPEG 30.64 28.75 2.79 5.34 36.88
EPIC 25.01 23.62 2.40 4.11 30.13
fft 24.75 23.43 2.29 3.97 29.69
(susan) 44.72 36.30 6.91 1.23 44.44
GSM 24.87 23.54 2.51 4.03 30.08
G721 25.18 23.78 2.40 4.21 30.39
rsynth 23.63 23.20 1.89 3.00 28.09
pegwit 25.05 23.71 2.70 4.13 30.54
sha 23.79 23.56 2.09 6.15 31.80
blowfish 25.18 23.76 2.49 4.14 30.39
rjindael 25.21 23.84 2.73 4.11 30.68
typeset 22.49 21.82 1.53 2.10 25.45
stringsearch 25.12 23.71 2.41 4.19 30.31
ispell 21.46 20.60 1.73 3.78 26.11
202
cumulative effect of HIE on RMA codes (A-B). This is because the RMA
conversion has eliminated some part of the HIE scope since the RMA has
better code density compared to RISC. The minimum affected benchmark
is sha, the application that got minimum code reduction for RMA. The
MIDACC extender estimates code size reduction due to use of composite
and compound instructions. The overall code reduction including all three
techniques in addition to HIE2 varies from 21.95 to 44.44. Interestingly, as
in the case of reduction due to HIE, the lame gets the lowest overall code
reduction and susan gets the overall maximum code reduction.
6.3 FUTURE ENHANCEMENT TO HIE-RMA
The following additional techniques can be adapted to achieve
increased code reduction for the MIPS object codes and extent of code
reduction can be estimated by suitable modifications to MIDAAC.
1. The maximum size of immediate/ offset can be further reduced.
2. ALU and shift can be combined in a single instruction.
3. A 40-bit instruction can be introduced so that RMA logic can be
enhanced to cover 16-bit offset also in the load instructions.
Presently such cases are not considered for RMA conversion.
This amounts to having five different lengths in hybrid
instruction encoding that will give scope for new types of
composite instructions. This will in turn provide additional
amount of code size reduction.
6.4 HIE AND ILP
Usually RISC processors are considered to be more suitable for
superscalar architecture than CISC processors, though Intel has
successfully implemented superscalar architecture from CISC processors
203
of 80486 onwards. In these days of multicore SoCs, there have been many
ideas being practiced that can be used in HIE-RISC processors. Some of
these are discussed below.
6.4.1 Hybrid-Length Instructions and Instruction fetch
A severe criticism against hybrid / variable-length instruction is
complexity of instruction fetch. However, this has been effectively
implemented in Intel 80486 [10]. Because the processor does not know the
length of the next instruction to be fetched, a typical strategy is to fetch a
number of bytes or words equal to at least the longest possible instruction.
This means that sometimes multiple instructions are fetched. The Intel 80486
is a processor with a five-stage pipeline and supports instructions of variable
length (from 1 to 11 bytes not counting prefixes). It has two 16-byte prefetch
buffers for the instructions. The status of the prefetcher relative to the other
pipeline stages varies from instruction to instruction. On an average, about five
instructions are fetched with each 16-byte load [10]. The instruction fetch
stage operates independently of other stages to keep the prefetch buffers full.
The Itanium processor [10] follows IA-64 architecture that supports
instruction-level parallelism. IA-64 defines a 128-bit bundle that contains
three instructions. The processor can fetch instructions one or more
bundles at a time; each bundle fetch brings in three instructions.
Instructions are fetched through an L1 instruction cache and fed into a
buffer that holds up to eight bundles of instructions. When deciding on
functional units for instruction dispersal, the processor views at most two
instruction bundles at a time.
Parallel fetch and decode is complicated by the need to examine
multiple bytes of an instruction before the start address of the next
sequential instruction is known. The Intel P6 microarchitecture can decode
three variable-length x86 instructions in parallel, but the second and third
instructions must be simple [55]. The P6 performs speculative decodes at
204
each byte position, then muxing out the correctly decoded instructions once
the lengths of the first and second instructions are known. The AMD Athlon
predecodes instructions during cache refill to mark the boundaries between
instructions and the locations of opcodes, scans and aligns multiple
variable- length instructions [56]. The Pentium-4 design [57] improves on
the P6 family by caching decoded fixed-length micro-ops in a trace cache.
These legacy CISC ISAs were not designed with parallel fetch and decode
in mind.
Two simple designs of instruction decoding used popularly in
processors using variable instruction set are shown in Figures. 6.4 and 6.5.
Figure. 6.4 uses an additional instruction decoder in the instruction pipeline.
The first decoder stage determines instruction lengths and steers the
instructions to second stage where the actual instruction decoding is
performed. The design methodology used in decoding x86 variable
instructions uses pre-decoding as shown in Figure. 6.5 to mark instruction
lengths in the code cache. This reduces the number of decode stages in
pipeline but the need to hold the resolved instruction information requires a
larger cache.
6.4.2 Instruction Fetch and Cache Access
Although embedded processors traditionally had simple single-issue
pipelines, current designs have deeper pipelines or superscalar issue to
meet higher performance requirements. A new heads-and-tails (HAT)
format [58] allows variable-length instructions to be held in the cache yet
remain easily indexable for parallel fetch and decode. This format can be
used for HIE-MIPS also to take advantage of the high code density of
hybrid- length instructions while enabling deeply pipelined or superscalar
processors. The HAT format packs multiple variable-length instructions into
fixed-length bundles. Each instruction is split into a fixed-length head
portion and a variable-length tail portion as shown in Figure. 6.6. In MIPS-
205
HAT scheme, the head size was 10 bits. For HIE-MIPS, six bits are
sufficient for the head. The fixed-length heads are packed together in
program order at the start of the bundle, while the variable-length tails are
packed together in reverse program order at the end (i.e., the first tail is at
the end of the bundle). Bundles contain varying numbers of instructions, so
each bundle begins with a small fixed-length field holding the number of the
last instruction in the bundle, i.e. a bundle holding N instructions has N1 in
this field. The remainder of the bundle is used to hold instructions. When
packing instructions into bundles, there can be internal fragmentation if the
next instruction does not fit into the remaining space in a bundle, in which
case the space is left empty and a new bundle is started [58]. The program
counter (PC) in a HAT scheme is split into a bundle number held in the high
bits and an instruction offset held in the low bits. During sequential
execution, the PC should be incremented as usual, but after fetching the
last instruction in a bundle (as given by the instruction count stored in the
bundle), it should skip to the next bundle by incrementing the bundle
number and resetting the instruction offset to zero. A PC value points
directly to the head portion of an instruction and, because they are fixed-
length, multiple sequential instruction heads can be fetched and decoded in
parallel. The tails are still variable-length, however, and so the heads must
contain enough information to locate the correct tail.
The heads-and-tails (HAT) format supports parallel fetch and
decode of compact variable-length instruction sets directly from cache. The
HAT format helps an implementation deliver multiple, variable- sized,
randomly-accessible instruction units to the CPU in a single cycle or
alternatively enables a deeply-pipelined fetch of such units. The HAT
format is used both in main memory and cache, although additional
information might be added to the cached version to improve performance.
206
Code cache
Length decoder
Input buffer
Instruction Steering Logic
Decoder Decoder Decoder
Figure. 6.4: Two stage instruction Decoding
Figure. 6.5: Predecoding and Marking Instruction Lengths
207
Figure. 6.6: Heads and Tails Format
A cache line could contain one or more bundles. Similar to a
conventional variable-length scheme, the tail size information in the head of
one instruction must be decoded to ascertain the location of the start of the
tail of the next instruction. But in the HAT format the length information for
each instruction is held at a fixed spacing in the head instruction stream,
independent of the length of the whole instruction. This makes the critical
path to determine tail alignment for multiple parallel instructions much
shorter than in a conventional variable-length scheme, where the location
of the length information in the next instruction depends on the length of the
current instruction. The tails in a HAT scheme are delayed relative to the
heads, but the head and tail fetches can be pipelined independently. The
authors [58] of HAT scheme experimented with MIPS codes and showed
that the MIPS-HAT format can provide a compression ratio of 75.5%
(Percentage reduction of 24.5%) and a dynamic fetch ratio reduction of
75.0% while supporting deeply pipelined or superscalar execution. They
developed a simple MIPS instruction compression scheme by re-encoding
the MIPS ISA into a variable-length format, and mapping the resulting
208
variable-length instructions into the HAT format. They evaluated use of
both 128-bit and 25-bit bundles for MIPS-HAT. In their scheme, each
instruction can be one of six sizes ranging from 15-40 bits. On the other
hand, the HIE-MIPS supports instructions of 12 sizes ranging from 8-32
bits. Hence HIE-MIPS should offer better results for HAT format. Figure.
6.7 shows HIE2-MIPS instructions recoded into HAT scheme.
The HAT scheme has a number of advantages over conventional
variable-length schemes. Fetch and decode of multiple variable-length
instructions can be pipelined or parallelized. Unlike conventional variable-
length formats, it is impossible to jump into the middle of an instruction. The
variable alignment muxes needed are smaller than in a conventional
variable-length scheme, because they only have to align bits from the tail
and not from the entire instruction length. The fixed-length heads are
handled using a much simpler and faster mux. The HAT format guarantees
that no variable-length instruction straddles a cache line or page boundary,
simplifying instruction fetch and handling of page faults.
The HAT scheme operates all the length decoders in parallel, and
then sums their outputs to determine tail alignments as shown in
Figure. 6.8. The tails in a HAT scheme are delayed relative to the heads,
but the head and tail fetches can be pipelined independently. The
performance impact of the additional latency for the tails can be partly
hidden if more latency-critical instruction information is located in the head
portions.
6.5 HIE-MIPS Vs microMIPS/Thumb2
For several years MIPS16 and ARM's Thumb were used for
embedded applications as discussed in Chapter 2 for code size reduction.
These two processors do not have independent ISAs and cause severe
performance penalty due to several restrictions. For instance, MIPS16 and
209
A B C D1 D2 E F G1 G2 G3 H I
8-bit:nop,syscall,rfe
it opcode
1 5
16-bit:mfcz,mtcz
it opcode
1 5 16-bit: mfhi, mflo, mthi,
it opcode
1 5 24-bit: add, addu, and,
it opcode
1 5 24-bit: sllv, srav, srlv,
it opcode
1 5 24-bit: sll, sra, srl
it opcode
1 5 16-bit: jalr, div, divu,
it opcode
1 5 24/32-bit: addi, addiu,
it opcode
1 5 8/16/24-bit: bczt, bczf
it opcode
1 5 16/24/32-bit: bgez, bgtz,
it opcode
1 5 32-bit: j,jal
it opcode
1 5 32-bit: break
it opcode
1 5 Heads
iid
2
rt rs
5 5 mtlo, jr
iid rd/rs 00
3 5 2 nor, or, sub, subu, xor
iid rs rt rd
3 5 5 5 slt, sltu
iid rs rt sa
3 5 5 5
iid rt rd sa 0
2 5 5 5 1 mult, multu
rs rt/rd
5 5 lui, slti, sltiu, beq, bgezal, bltzal, bne, lb, lbu ...
hl rs rt imm/off
1 5 5 7/15
hl imm/off
2 0/8/16 blez, bltz
iid hl rs 0 imm/off
2 2 5 1 0/8/16
target
26
code 0’s
20 6
Tails
Figure. 6.7: HIE2-MIPS Instruction Types in HAT Scheme
210
Figure. 6.8: Variable-length decoding in a HAT Scheme
Thumb supported only reduced number of registers. Subsequently both
ARM and MIPS introduced Thumb2 and microMIPS with two instruction
length options: 16 bit and 32 bits. These perform internal recoding of
instructions to the original ARM or MIPS processor versions and yet do not
provide full facilities to all instructions. For instance, some 16-bit microMIPS
instructions can access only eight GPRs. In total, the microMIPS ISA adds
54 new instructions [59]. The instruction decoder performs two operations
in sequence. First, the decoder translates the microMIPS into a MIPS32
instruction thus incorporating dual decoders. On the other hand, the HIE-
MIPS will have an independent ISA. The microMIPS achieves code size
reduction of 30% compared to MIPS32. The HIE-RMA-MIPS version
achieves over 44% code size reduction. The Thumb2 has about 24K gates
whereas the microMIPS has approximately 33K gates. The microMIPS
offers 98% performance of MIPS32. The HIE-RMA-MIPS needs single
decoder and take less decode time than microMIPS.
6.6 DISCUSSION AND CONCLUSION
Present day smart phones are highly advanced with quad core and
octa core processors. These smart phones have flash memory of 32 GB or
more. Most of them support entire embedded applications covered by
MiMedia. Hence the sizes of all these can be summed up to estimate code
211
memory requirement. The total size exceeds one crore bytes. The average
HIE+RMA code reduction percentage of 26% translates to saving of about 2
Mega bytes of code memory without considering the OS and other system
software that is also part of the embedded core memory.
The aim of this research is to enhance the 30 years old RISC
architecture with features that make the RISC processors more relevant to
the battery operated embedded systems based on SoCs. Hybrid Instruction
Encoding with hybrid offset and immediate fields has been shown to
improve the code density considerably. The work undertaken generated a
primitive but reasonably functional set of tools for static simulation of
MIPS32 based HIE and RMA architecture. The developed tool chain
helped studying the behaviour of embedded codes and also estimating
code size reduction for three different variations. This work has thus
provided strong evidence that the HIE-RMA can indeed serve as
embedded processor architecture for applications that do not demand
extremely high performance. Actually, the HIE-RMA should increase
performance since size of code cache is also reduced. However, increase
in instruction decoding increases the execution time of single instruction.
This will not make much impact on superscalar processors.
However, the performance of this architecture appears to trail
traditional CISC and RISC processors in terms of instruction fetch and
execution efficiency. Viewed in totality in the context of multicore and
superscalar architecture, the individual processor performance is not a
concern but achieving effective parallelism and minimising memory size are
more relevant for embedded SoCs. Configurable processor technology is
fast emerging particularly in the design of higher performing consumer
products. It allows customization of the core processor, which can have
both performance and power impact on the SoC embedded design. The
Tensillica Xtensa LX2 Series is one such configurable processor [60].
212
The following sections provide a summary of the research work,
outline the limitations of the described work and suggestions for future
work.
6.6.1 Summary of Contributions
The main contributions of this work are summarized below.
Behaviour Analysis of Embedded Applications
The object codes of the embedded applications are profiled by the
custom built software tool, MIDAAC. Applications from two representative
set of embedded benchmarks, MiBench and MediaBench are cross
compiled for the MIPS32 processor for this purpose. Apart from measuring
the static instruction frequencies, this tool measures the total amount of
under utilization of the offset and immediate fields in the object code.
Design of Hybrid Instruction Encoding for RISC Processors
Two versions of Hybrid Instruction Encoding (HIE) are designed for
supporting multiple instruction sizes and hybrid lengths for the offset and
the immediate fields. For each of the 66 integer instructions of MIPS32, an
equivalent HIE instruction has been designed.
Design of Register Memory Architecture (RMA) for RISC
Processor
This part of the research work involves design of 12 RMA ALU
instructions, for MIPS processor. Appropriate formats are chosen for the
RMA instructions. The traditional RISC pipeline sequence is rearranged to
suit both LSA and RMA instructions. Identification of the requirement for
any datapath is carried out.
213
Hybrid RISC Processor
For simulating a hybrid processor incorporating both HIE and RMA,
the embedded object codes of MIPS32 are recoded to the HIE-RMA
processor using the custom built code converter and the code size
reduction is measured.
Developing Static Simulator for HIE / RMA
To estimate the efficiency of the HIE and RMA for RISC processor, a
code converter tool, MIPS Instruction Distribution Analyser cum Code
Converter (MIDACC), has been developed. The MIDACC converts the
object codes from MIPS ISA to HIE/RMA-ISA. This tool also measures the
code size savings for embedded applications in MiBench and MediaBench
benchmark suites.
6.6.2 Limitations of Described Research Work
The uniqueness of the proposed architecture greatly limited reuse of
existing infrastructure and tools. Thus much of the research period was
used to develop tools and static simulation. Lack of suitable compiler to try
dynamic simulation was a major hurdle. Since the custom built tool
MIDACC presently recognises only MIPS32 object codes, no comparison
with ARM like processor could be done. Popular processor simulators such
as SimpleScalar and ArchC need extensive modification to try HIE-RMA for
RISC processors.
6.6.3 Areas for Future Work
There are some areas of broader study that are left open for future
research.
214
6.6.3.1 MicroMIPS and Thumb2 versions with HIE-RMA
MIPS32 is a pure general-purpose RISC processor whereas
microMIPS and Thumb2 are embedded processor versions. These have a
large scope of redesign with HIE-RMA.
6.6.3.2 Reconfigurable HIE-RMA version
Reconfigurability will provide embedded developers scope for
optimising application dependent features.
6.6.3.3 Dynamic simulation
SimpleScalar like simulator can be modified and dynamic simulation
can be carried out to accurately estimate performance and power
consumption.
6.6.3.4 FPGA Processor design
A HIE-RMA processor prototype can be developed based on
microMIPS or Thumb2.
6.6.3.5 Compiler tool chain
A tool chain to suit the HIE-RMA processor to work with the FPGA
processor is also required to put the HIE-RMA processor in use.
6.6.3.6 HIE-RMA-HAT Processor
A new ISA applying HAT scheme to the HIE-RMA processor core
suiting multicore SoCs can be developed.
215
REFERENCES
[1] Joseph Byrne and Tom R. Halfhill, “A Guide to Embedded
Processors”, Linley Group, Seventh Edition, CA, 2012.
[2] Xie Y., Wolf W. and Lekatsas H., “Code Compression for
Embedded VLIW Processors Using Variable-to-Fixed Coding”, IEEE
Trans. VLSI Systems, Vol. 14, pp. 525-536, 2006.
[3] Hennessy J. L and Patterson D. A, “Computer Architecture: A
Quantitative Approach”, Morgan Kaufmann, Fifth Edition, San
Francisco, CA, 2012.
[4] Santhosh Chede and Kishore Kulart, “Design Overview of Processor
Based Implantable Pacemaker”, Journal of Computers, Vol. 3,
pp. 49-57, 2008.
[5] Frank Vahid and Tony Givargis, “Embedded System Design: A
Unified Hardware / Software Introduction”, Third Edition, John Wiley
& Sons, U.K, 2002.
[6] Noergaard T., “Embedded Systems Architecture: A Comprehensive
Guide for Engineers and Programmers”, Elseiver, 2005.
[7] Fisher J. A, Faroboschi P. and Young C., “Embedded Computing: A
VLIW Approach to Architecture, Compilers, and Tools”, Morgan
Kaufmann, San Francisco, CA, 2005.
[8] Raj Kamal, “Embedded Systems: Architecture, Programming and
Design”, Mc Graw-Hill, Second Edition, 2008.
[9] Govindarajalu B., “Computer Architecture: and Organization: Design
Principles and Applications”, Mc Graw-Hill, Second Edition, 2010.
[10] William Stallings, “Computer Organization and Architecture:
Designing for Performance”, Pearson Education Inc, Eighth edition,
2010.
[11] Cragon H.G, “Computer Architecture and Implementation”,
Cambridge University Press, UK, 2000.
216
[12] Bhandarkar D. and Clark D. W, “Performance from Architecture:
Comparing a RISC and a CISC with similar Hardware Organization”,
ACM SIGARCH Computer Architecture News, Vol. 19, pp. 310-319,
1991.
[13] Patterson D. A and Sequin C.H , “A VLIW RISC”, Computer,
Vol. 15, pp. 8-21, 1982.
[14] Moore G.E, “Ceamming more components onto integrated circuits”,
Electronics, Vol. 38, pp. 114-117, 1965.
[15] Patterson D.A and Ditzel D.R “The case for the reduced instruction
set computer”, SIGARCH Comp. Arch. News, Vol. 8, 1980.
[16] Colwell D.R, Hitchcock C.Y, Jensen E , Brinkley Sprunt H. and
Kollar C., “Instruction sets and beyond: computers, complexity, and
controversy”, Computer, Vol. 18, pp. 8-19, 1985.
[17] Flynn M.J, Mitchell C.L and Mulder J.M, “And now a case for more
complex instruction sets”, Computer, Vol. 20, 1987.
[18] Rajesh kumar T.S, “On-Chip Memory Architecture Exploration of
Embedded System on Chip”, PhD thesis, Supercomputer Education
and Research Centre, Indian Institute of Science, Bangalore, 2008.
[19] Balasa F., Catthoor F. and De Man H., “Background memory area
estimation for multidimensional signal processing systems”, IEEE
Trans. VLSI system, Vol. 3, pp. 157-172, 1995.
[20] International Technology Roadmap for semiconductors,
SEMATECH, 3101, Industrial Terrace Suite, 106 Austin TX
78758, 2001.
[21] Steve Furber, “ARM System-on-Chip Architecture”, Pearson
Education Limited, Second Edition, 2000.
[22] Nam Sung Kim, Todd Austin, David Blaauw, Trevor Mudge,
Krisztian Flautner, Jie S. Hu, Mary Jane Irwin, Mahmut Kandemir,
and Vijaykrishnan Narayanan, “Leakage current: Moore's law meets
static power”, Computer, Vol. 36, pp. 68-75, 2003.
[23] Smith A.J, “Cache memories”, ACM Computing Surveys, 1993.
217
[24] Grishman Ralph, “Assembly Language Programming for the Control
Data 6000 Series”, Algorithmics Press, Second Edition, p.12, 1974.
[25] Jack J. Dongarra, “Numerical Linear Algebra on High-Performance
Computers”, p. 6, 1987
[26] Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam,
“Power Struggles: Revisiting the RISC Vs CISC Debate on
Contemporary ARM and RISC Architectures”, 19th IEEE
International Symposium on High Performance Computer
Architecture (HPCA), 2013.
[27] Patterson D. A and Hennessy J. A, “Computer Organization &
Design: The Hardware / Software Interface”, Morgan Kaufmann
Publishers, Fourth Edition, San francisco, CA, 2009.
[28] Hikkinen J., Takala J. and Corporaal H. “Dictionary - based Program
Compression on Customizable processor Architectures”,
Microprocessors and Microsystems, Vol. 33, pp.139-153, 2009.
[29] Huffman D. A, “A method for the construction of minimum-
redundancy codes”, in Proc.IRE, pp. 1098-1101, 1952.
[30] Wolfe A. and Chanin A., “Executing compressed programs on an
embedded RISC architecture”, Int. Symp. Microarch, pp. 81–91,
1992.
[31] Kozuch M. and Wolfe A., “Compression of embedded system
programs”, IEEE International Conference Computer Design,
Cambridge, MA, pp. 270-277, 1994.
[32] Lakatsas H. and Wolf W., “Code Compression for embedded
systems”, 35th Conference on Design Automation, San Francisco,
CA, pp. 516-521, 1998.
[33] Lefurgy C., Bird P., Cheng I. and Mudge T., “Improving Code density
using compression techniques”, 30th Annual International
Symposium on Microarchitecture, Research Triangle Park, NC, pp.
194-203, 1997.
218
[34] Araujo G., Centoducatte P., Cortes M. and Pannanin R, “Code
compression based on operand factorization”, 31st Annual
ACM/IEEE International Symposium on Microarchitecture, Dallas,
TX, USA, pp. 194-201, 1998.
[35] Xie Y. and Wolf W., “Profile-driven code compression”, Des. Autom.
Test Eur, pp. 462-467, 1992.
[36] Bell T., Cleary J., and Witten I., “Text Compression”, Prentice Hall,
1990.
[37] Lin K.J and Wu C.W, “A Low-Power CAM Design for LZ Data
Compression”, IEEE Trans. Computers, Vol. 49, pp. 1139-1145,
2000.
[38] Benini L., Menichelli F. and Olivieri M., “A Class of Code
Compression Schemes for Reducing Power Consumption in
Embedded Microprocessor Systems”, IEEE Trans.Computers, Vol.
53 , pp. 467-482, 2004.
[39] Lin C.H, Xie Y. and Wolf W., “Code Compression for VLIW
Embedded Systems Using a Self - Generating Table”, IEEE Trans.
VLSI Systems, Vol. 15, pp. 1160-1171, 2004.
[40] Kemp T.M, Montoye R.K, Harper J.D, Palmer J.D, and Auerbach
D.J, “A decompression core for power PC”, IBM J.Res.Develop.,
Vol. 42, pp. 807–812, 1998.
[41] Bird P. and Mudge T., “An instruction stream compression
technique”, Elect. Eng. Comp. Sci. Dept., Univ. Michigan, Lansing,
MI, Tech. Rep. CSE-TR-319-96, 1996.
[42] Guido Araujo, Paulo Centoducatte, Rodolfo Azevedo, and Ricardo,
“Expression-Tree-Based Algorithms for code compression on
Embedded RISC Architectures”, IEEE Trans. VLSI Systems, Vol. 8,
pp. 530-533, 2004.
[43] Weiss A.R, “The Standardization of Embedded Benchmarking:
Pitfalls and Opportunities”, Int'l Conf. on Computer Design
(ICCD'99), Austin, TX, pp. 492-498, 1999.
219
[44] Copper K. D and McIntosh N., “Enhanced code compression for
embedded RISC processors”, SIGPLAN Conf. Program. Lang. Des.
Implement, pp. 139-149, 1995.
[45] Debray S. and Evans, “Profile-guided code compression”, Conf.
Program. Lang. Des. Implement, pp. 95-105, 2002.
[46] Debray S. and Evans, “Cold code decompression at runtime”,
Commun. ACM, Vol. 46, pp. 54-60, 2003.
[47] Ozturk O., Saputra H., Kandemir M., and Kolcu I., “Access pattern-
based code compression for memory-constrained embedded
systems”, Des, Autom. Test Eur. Conf. Expo, pp. 882-887, 2005.
[48] Shogan S. and Chiders B. R, “Compact binaries with code
compression in a software dynamic translator”, Des., Autom. Test
Eur. Conf. Expo, pp. 1052-1057, 2004.
[49] Sloss N.A, Symes D. and Wright C., “ARM System Developer's
Guide Designing and Optimizing System Software”, Morgan
Kaufmann Publishers, San Francisco, CA, 2004.
[50] Gerard P.M and Charles P.M, “The RISC Processor DMN-6: A
Unified Data-Control Flow Architecture”, AGM SIGARCH Computer
Architecture News, Vol. 24, 1996.
[51] Lee C., Potkonjak M. and Mangione-Smith W.H, “MediaBench: A
Tool for Evaluating and Synthesizing Multimedia and
Communication System”, Proc. Int'l Symp. Microarchitectures, pp.
330-335, 1997.
[52] Guthaus M. R, Ringenberg J. S, Ernst D., Austin T. M , Mudge T.
and Brown R. B, “MiBench: A free, Commercially Representative
Embedded Benchmark Suite”, In Proceedings of the 4th Annual
Workshop on Workload Characterization, pp. 3-14, 2001.
[53] EDN Embedded Microprocessor Benchmark Consortium, CA, 2013.
[54] Sima D., Fountain T. and Kacsuk P., “Advanced Computer
Architectures: A design space approach”, Pearson Education, 1997.
[55] Circello J., “The superscalar architecture of the MC68060”, IEEE
Micro, Vol. 15, pp. 10–21, 1995.
220
[56] AMD Athlon Processor x86 Code Optimization, chapter Appendix A:
AMD Athlon Processor Microarchitecture. AMD Inc., 220071-0
edition, 2000.
[57] Hinton G., “The microarchitecture of the Pentium 4 processor”, Intel
Technology Journal, Q1 2001.
[58] Heidi Pan and Krste Asanovi´c, “Heads and Tails: A VariableLength
Instruction Format Supporting Parallel Fetch and Decode”,
CASES’01, Atlanta, Georgia, USA, 2001.
[59] “microMIPS Instruction Set Architecture”, MIPS Technologies, Inc.,
2009
[60] Greg Osborn, “Embedded Microcontrollers and Processor Design”,
Pearson Education Limited, 2012.
221
LIST OF PUBLICATIONS
[1] Govindarajalu B. and Mehata K.M, “Code Size Reduction in
Embedded Systems with Redesigned ISA for RISC Processors”,
International Journal of Computer Applications, Vol. 64, No. 12,
pp. 38-45, 2013.
[2] Govindarajalu B. and Mehata K.M, “A case for hybrid instruction
encoding for reducing code size in embedded system-on-chips
based on RISC processor cores”, J. Comput. Sci., Vol. 10,
pp. 411-422, 2014.
[3] Govindarajalu B., Mehata K.M and Ramakrishnan R., “Enhanced
hybrid instruction encoding for portable embedded Systems”,
International Journal of Embedded Systems, InderScience
Publishers, Communicated.
222
APPENDIX 1
MIDACC ARCHITECTURE
A1.1 INTRODUCTION
Figure. A1.1 shows the functional block diagram of MIDACC
indicating different functional modules other than MIDACC extender.
Figure. A1.1: Functional block diagram of MIDACC
A1.2 MIDACC INTERNALS
MIDACC’s code analyser is named MIDA and the code converter,
MICC. The MIDACC extender is an independent software tool for
estimating the scope for compound and composite instructions for HIE-
MIPS.
223
A1.2.1 MIDA Internals
Given a MIPS32 code, MIDA profiles the code and produces the
following statistics.
A1.2.1.1 Instruction class distribution
The following are the steps involved in finding the instruction class
distribution:
1) Scan the instruction
2) Identify the instruction and the type using table A3.1 and
maintain the count for each instruction type
3) Increment the instruction address by 4 and repeat steps 1-3 for
all the instructions.
4) Add the count of instructions belonging to the same type to get
the total count for each instruction type.
A1.2.1.2 Instruction distribution
The following are the steps involved in instruction distribution
identification:
1) Scan the instruction
2) Using table A3.1 identify the instruction name. Maintain a count
for each instruction.
3) Increment the instruction address by 4 and repeat steps (1) and
(2) for all the instructions.
224
A1.2.1.3 MIPS Code redundant 0’s Distribution
The following steps are used to find the MIPS code redundant 0’s
distribution:
1) Scan the instruction
2) Identify the instruction as per A1.2.1.2. Maintain a count for
each instruction. Identify the RZ value of the instruction using
table 4.4.
3) Increment the instruction address by 4 and repeat steps (1) and
(2) until reaching the end of the program.
4) Multiply the count of each instruction by corresponding RZ
value and add the values obtained for all instructions to get the
TRZ value.
A1.2.1.4 Branch instruction distribution
The following steps are used to find the branch instruction
distribution:
1) Scan the instruction
2) Identify the type of each instruction using table A3.1. If
instruction type is branch, then go to step 3. Otherwise go to
step 4.
3) Identify the instruction and maintain a count for each branch
instruction.
4) Increment the instruction address by 4 and repeat steps (1) and
(2) for all the instructions.
225
A1.2.1.5 WASTIO Calculation
The algorithm in Figure. A1.2 is used to compute the WASTIO
percentage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Input : Object code of program
Output : WASTIO Percentage
Algorithm
for each instruction in the program
if instruction format == I || instruction format == O then
Read the value immediate/offset field of the instruction
if all zeros in both LSB and MSB then
a=a+1
else if all zeros in LSB then
b=b+1
else if all zeros in MSB then
c=c+1
else
do nothing
fi
fi
end
WASTIO=2a+b+c
WASTIO Percentage = (WASTIO/Object code size) * 100
Figure. A1.2: Algorithm for WASTIO calculation
A1.2.1.6 Population of FTFI
The algorithm in Figure. A1.3 is used to compute the population of
FTFI.
226
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Input : Object code of program
Output : Population of FTFI
Algorithm
addu_count=0
addiu_count=0
lw_count=0
sw_count=0
for each instruction in the program
Read the value OP and OPX field of the instruction
if OP==000000 && OPX=100001 then
addu_count=addu_count+1
else if OP==001001 then
addiu_count=addiu_count+1
else
do nothing
fi
Check the hex value of MSB of the instruction
if value==0x8C || value==0x8D || value==0x8E ||
value==0x8F then
lw_count=lw_count+1
else if value==0xAC || value==0xAD || value==0xAE ||
value==0xAF then
sw_count=sw_count+1
else
do nothing
fi
end
FTFI=addu_count+addiu_count+lw_count+sw_count
Figure. A1.3: Algorithm for Population of FTFI
A1.2.1.7 Registers usage behaviour
The following are the steps involved in finding the registers usage
behaviour:
1) Scan the instruction
2) Identify the instruction type and format using Table A3.1
227
3) If instruction type is ALU and format is R, then go to step 4. If
the instruction type is ALU and format is I, then go to step 5.
Otherwise go to step 6.
4) For R-format instructions, check the MSB of rs,rt and rd fields
for each combination and maintain a count of each instruction
for each pattern.
5) For I-format instructions, check the MSB of rs and rt fields for
each combination and maintain a count of each instruction for
each pattern
6) Increment the instruction address by 4 and repeat steps (1) and
(2) for all the instructions.
A1.2.1.8 Shift length usage
The algorithm in Figure. A1.4 is used to compute the shift length
usage.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input : Object code of program
Output : Percentage of shift amount between 16-31
Algorithm
for each instruction in the program
Read the value of OP and OPX field
if OP==000000 && OPX==000000 then
Read the value of MSB of sa field
if MSB==0 then
sll_count_zero=sll_count_zero+1
else
sll_count_one=sll_count_one+1
fi
else if OP==000000 && OPX=000011then
Read the value of MSB of sa field
if MSB==0 then
sra_count_zero=sra_count_zero+1
else
sra_count_one=sra_count_one+1
228
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
fi
else if OP==000000 && OPX=000010 then
Read the value of MSB of sa field
if MSB==0 then
srl_count_zero=srl_count_zero+1
else
srl_count_one=srl_count_one+1
fi
else
do nothing
fi
end
Total usage of shift amount= sll_count_zero + sra_count_zero +
srl_count_zero
Shift amount between 16 and 31= sll_count_one + sll_count_one
+sll_count_one
Percentage Shift amount between 16 and 31
=100* ((Shift amount between 16 and 31) / (Total usage of shift
amount))
Figure. A1.4: Algorithm for Shift Length usage computation
A1.2.1.9 Immediate field usage pattern
The following are the steps involved in finding the immediate field
usage pattern:
1) Scan the instruction
2) Identify the instruction type and format using Table A3.1
3) If instruction type is ALU and format is R, then go to step 4.
Otherwise go to step 5.
4) Check the immediate fields for all combinations and maintain a
count of each instruction for each combination.
5) Increment the instruction address by 4 and repeat steps (1) and
(2) for all the instructions.
229
A1.2.1.10 Offset field usage pattern
The following are the steps involved in finding the offset field usage
pattern:
1) Scan the instruction
2) Identify the instruction format using Table A3.1
3) If the instruction format belongs to offset, then go to step 4.
Otherwise go to step 5.
4) Check the offset fields for all combinations and maintain a
count of each instruction for each combination.
5) Increment the instruction address by 4 and repeat steps (1) and
(2) for all the instructions.
A1.2.2 MICC Internals
A1.2.2.1 HIE1 code conversion
Figure. A1.5 shows the algorithm for HIE1 code conversion.
Input : MIPS32 instruction
Output : HIE1 instruction
Algorithm
1
2
3
4
5
6
7
8
9
10
for all the instructions
Retain opcode of all instruction as such
Identify HIE1 instruction type using table A4.1
if(HIE1 type==H && HIE1 type==I)
Retain instruction as such
Go to the next instruction
else if(HIE1 type==A)
Delete all other existing fields
Insert 2-bit iid field with value as shown in table A4.1
else if(HIE1 type==B)
230
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Insert 1-bit iid field with value as shown in table A4.1
Remove 1-bit from rt field
Retain rd field as such
else if(HIE1 type==C)
Remove 1-bit from rd/rs field
Retain function bits as such
else if(HIE1 type==D)
Remove 1-bit each from rs, rt and rd field
Retain function bits as such
else if(HIE1 type==E)
Remove 1-bit each from rt, rd and sa field
Retain function bits as such
else if(HIE1 type==F)
Remove 1-bit each from rs and rt field
Delete 6 unused 0’s
Retain function bits as such
else if(HIE1 type==G)
Insert 2-bit hl field using hl HIE1 MIPS encoding in table
4.5
fi
end
Figure. A1.5 : HIE1 code conversion algorithm
A1.2.2.2 HIE2 code conversion
Figure. A1.6 shows the algorithm for HIE2 code conversion.
231
Input : MIPS32 instruction
Ouput : HIE2 instruction
Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
for all the instructions
Identify HIE2 instruction type using table A5.1
if(HIE2 type==H && HIE2 type==I)
Retain instruction as such
Go to the next instruction
fi
if(HIE2 type==G1 || HIE2 type==G2 || HIE2 type==G3)
insert it field with it=0
if(HIE-2 type==G1)
insert 1-bit hl field using hl encoding in table A5.2
else if(HIE2 type==G2)
insert 2-bit hl field using hl encoding in table A5.3
else
insert 2-bit hl field using hl encoding in table A5.4
fi
else
insert it field with it=1
fi
Replace 6-bit MIPS32 opcode with 5-bit HIE2 opcode using table
A5.1
if(HIE2 type== A && HIE2 type==E && HIE2 type==G3)
insert 2-bit iid field with value as shown in table A5.1
fi
if(HIE2 type==C && HIE2 type==D1 && HIE2 type==D2)
insert 3-bit iid field with value as shown in table A5.1
fi
end
Figure. A1.6 : HIE2 code conversion algorithm
A1.2.2.3 RMA Code Conversion
The Figure. A1.7 depicts the overview of RMA code conversion
process for the sequence “load instruction followed by ALU type
instruction”. The Figure. A1.8 depicts the overview of RMA code conversion
process for the sequence “ALU type instruction followed by Store
instruction”.
232
Figure. A1.7: Overview of RMA code conversion for load sequence
NO
YES
NO
YES Type J
TYPE R / Type I
YES
NO
Test opcode of current instruction
Create RMA table and Step1-Jump address table
Convert MSB of instruction from HEX format to decimal format
MSB = MSB >> 2
Is MSB== 23
Is Instruction type == ALU
Instruction type?
Is qualifying Load?
Create new RMA instruction
Decrement all subsequent instruction addresses by 4
Test opcode of next instruction
Increment instruction address by 4
To identify Load
instruction
233
Figure. A1.8: Overview of RMA code conversion for store sequence
Increment instruction
address by 4
NO
NO
YES
YES
YES
NO
Type R
Type I/ Type J
To identify Store
instruction
Test opcode of current instruction
If Instruction type == ALU
Instruction type?
Is qualifying Store?
Test opcode of next instruction
Convert MSB of instruction from HEX format to decimal format
MSB = MSB >> 2
Is MSB== 43
Create new RMA instruction
Decrement all subsequent instruction addresses by 4
Create RMA table and Step1-Jump address table
234
A1.2.2.4 RMA+HIE1 code conversion
The Figure. A1.9 depicts the overview of RMA+HIE1 code
conversion process
Figure. A1.9: RMA+HIE1 code conversion
A1.2.2.5 RMA+HIE2 code conversion
The Figure. A1.10 depicts the overview of RMA+HIE2 code
conversion process.
Figure. A1.10: RMA+HIE2 code conversion
A1.3 MIDACC EXTENDER
The MIDACC Extender estimates the scope for the following three
requirements: use of compound instructions for D1 and E type instructions
in HIE2 code, conversion of the add and addu instructions having same
register for both the source operands into two-address instructions, and
use of two composite instructions: loadmultiple and storemultiple.
Perform RMA code conversion
Perform HIE1 code conversion
Perform RMA code conversion
Perform HIE2 code conversion
235
APPENDIX 2
MIDACC USER GUIDE
A2.1 INTRODUCTION
This User’s guide discusses briefly the provisions of MIDACC, a
custom-built code analyser cum code convertor tool suite and describes how
it can be used. MIDACC’s code analyser is named MIDA and the code
converter, MICC. MIDACC is a Windows-based application and designed to
function with Microsoft Windows XP and above, having .NET framework 3.5
and above. Use of MIDACC extender is included in the midacc website.
A2.2 INSTALLING MIDACC
Visit the following url
https://midacc.wordpress.com/
and download the zip file, MIDACC Suite.zip to your system/ PC. Unzip the
file and you should see the folder on your system/ PC as shown in
Figure. A2.1.
Figure. A2.1: Snapshot of MIDACC Suite installation folder
236
Right-click the “MIDACC Suite Windows Installer” file and select
“Install”. You will see the welcome screen as shown in Figure. A2.2.
Figure. A2.2: Snapshot of MIDACC Suite welcome screen
Click “Next” to proceed to the next screen as shown in Figure. A2.3.
Select the installation directory where MIDACC has to be installed. The
default directory is “C:\ProgramFiles\MIDACC Suite”. Choose the user for
whom MIDACC suite has to be installed.
Figure. A2.3: Snapshot of MIDACC installation screen
237
Click “Next” to proceed to the installation process shown in
Figure. A2.4.
Figure. A2.4: Snapshot of MIDACC installation process
Click “Next” to start the installation. You will see a screen as shown in
Figure. A2.5 showing the installation status.
Figure. A2.5: Snapshot of MIDACC installation status
238
Figure. A2.6: Snapshot of MIDACC installation completion
You have succesfully installed MIDACC suite after seeing a screen as
shown in Figure. A2.6. Click “Close” to complete the installation.When
properly installed MIDACC can be launched by clicking the “MIDACC suite”
icon on the PC desktop or selecting from the “Start Menu” as shown in
Figure. A2.7.
Figure. A2.7: Snapshot of MIDACC icon in desktop and start menu
239
When launched, you will see two tabs across the top of the MIDACC
suite as shown in Figure. A2.8.
Figure. A2.8: Snapshot of MIDACC Suite tool
A2.3 INPUT FORMAT REQUIRED BY MIDACC
The object code is obtained from the C program by cross compilation
process (for details, refer cross compilation procedure in section A2.6) using
a crosscompiler such as the Sourcery CodeBench tool.
Figure. A2.9: Snapshot of assembly code of SUSAN
240
Figure. A2.9 shows the snapshot of the object code of the SUSAN
program obtained by the cross compilation process.The object code
obtained is then manually processed to get the input format accepted by
MIDACC suite as shown in Figure. A2.10.
Figure. A2.10: Snapshot of input format accepted by MIDACC
A2.4 USING MIDACC
A2.4.1 MIDA Tab
When the program is launched, select the “MIDA” tab on the top of the
MIDACC suite. Figure. A2.11 shows the snapshot of the MIDA tab. It profiles
the object code and provides reports such as the static instruction
frequencies and total amount of under utilization of the offset and immediate
fields in the object code.
Figure. A2.11: Snapshot of MIDA Tab
241
Click on the “Browse” button to select the location of the input object
code file. Click on the desired file and click on the “Open” button. To start the
code analysis process, click on the “Perform Code Analysis” button. Once
the process is completed you will receive a message as shown in the figure.
A2.12 and a message in the status bar notifying the location of the reports
generated by the code analysis process.
Figure. A2.12: Snapshot of code analysis process using MIDA Tab
The various reports generated by the code analysis process are
A. Instruction class distribution
B. Instruction distribution
C. MIPS code redundant 0’s distribution
D. Branch Instruction distribution
E. WASTIO calculation
F. Population of frequently used top four instructions (FTFI)
G. Register usage behaviour
H. Shift length usage
I. Immediate field usage pattern
J. Offset field usage pattern
242
A2.4.2 MICC Tab
It converts the object code from MIPS ISA to HIE/RMA ISA and
measures the code size savings for the application. Figure. A2.13 shows the
snapshot of MICC Tab of MIDACC suite. The MICC tab has five code
conversion functions:
HIE1 Conversion
HIE2 Conversion
RMA Conversion
RMA + HIE1 Conversion
RMA + HIE2 Conversion
Figure. A2.13: Snapshot of MICC Tab
Click on the “Browse” button to browse the location of the input object
code file. Click on the desired file and click on the “Open” button. To start the
code conversion process click on any one of the five buttons as shown in the
above figure. Once the process is completed you will receive a message as
243
shown in the following figures and a message in the status bar notifying the
location of the reports generated by the respective code conversion process.
A2.4.2.1 HIE1 code conversion
Figure. A2.14 shows the snapshot of the HIE1 code conversion
process. With HIE1 code conversion the reports generated are:
A. HIE1 Code Redundant 0’s distribution
B. Code size summary
C. Percentage Code Reduction (PCR)
Figure. A2.14: Snapshot of HIE1 Code conversion process
A2.4.2.2 HIE2 code conversion
Figure. A2.15 shows the snapshot of the HIE2 code conversion
process. With HIE2 code conversion the reports generated are:
A. HIE1 Code Redundant 0’s distribution
B. Code size summary
C. Percentage Code Reduction (PCR)
244
Figure. A2.15: Snapshot of HIE2 Code conversion process
A2.4.2.3 RMA Code conversion
Figure. A2.16 shows the snapshot of the RMA code conversion
process. With RMA code conversion the reports generated are:
A. RMA scope analysis
B. Code size summary
C. Percentage Code Reduction (PCR)
Figure. A2.16: Snapshot of RMA Code conversion process
245
A2.4.2.4 RMA+HIE1 code conversion
Figure. A2.17 shows the snapshot of the RMA+HIE1 code conversion
process. With RMA+HIE1 code conversion the reports generated are:
A. RMA scope analysis
B. Code size summary
C. Percentage Code Reduction (PCR)
Figure. A2.17: Snapshot of RMA+HIE1 Code conversion process
A2.4.2.5 RMA+HIE2 code conversion
Figure. A2.18 shows the snapshot of the HIE1 code conversion
process. With RMA+HIE2 code conversion the reports generated are:
A. RMA scope analysis
B. Code size summary
C. Percentage Code Reduction (PCR)
246
Figure. A2.18: Snapshot of RMA+HIE2 Code conversion process
A2.5 SAMPLE OUTPUT OBTAINED USING MIDACC
The following section gives the sample output generated using
MIDACC suite for SUSAN program in MiMedia Benchmark.
A2.5.1 Code Analysis Report
A. Instruction class distribution: Program: susan
--------------------------------------------------------------------------------
Instruction type Count Percentage
--------------------------------------------------------------------------------
LOAD 4924 38.62
STORE 1495 11.725
ALU 4097 32.133
INTERRUPT 608 4.769
COMPARE 276 2.165
CONMANIP 138 1.082
BRANCH 361 2.831
JUMP 279 2.188
DATA MOVE 139 1.09
REF 0 0
--------------------------------------------------------------------------------
TOTAL 12317
--------------------------------------------------------------------------------
247
B. Instruction Distribution: Program: susan
----------------------------------------------------------------------------------------------------
Instruction Count % Cumulative Cumulative Type
Count %
----------------------------------------------------------------------------------------------------
LW 3883 30.455 3883 30.455 Load, O
ADDU 1986 15.576 5869 46.031 ALU, R
SW 1310 10.275 7179 56.306 Store, O
ADDIU 1129 8.855 8308 65.161 ALU, I
LBU 972 7.624 9280 72.785 Load, O
NOP 608 4.769 9888 77.554 Interrupt
SUBU 458 3.592 10346 81.146 ALU, R
SLL 423 3.318 10769 84.464 ALU, R
SLT 200 1.569 10969 86.033 Compare, R
BEQ 171 1.341 11140 87.374 Branch, O
SB 166 1.302 11306 88.676 Store, O
BNE 154 1.208 11460 89.884 Branch, O
LUI 138 1.082 11598 90.966 CONMANIP, I
J 115 0.902 11713 91.868 Jump, T
JAL 112 0.878 11825 92.746 Jump, T
MTCZ 105 0.824 11930 93.57 Data move, R - O
LWCZ 65 0.51 11995 94.08 Load, O
SLTIU 52 0.408 12047 94.488 Compare, I
JR 47 0.369 12094 94.857 Jump, R
ANDI 40 0.314 12134 95.171 ALU, I
BCZT 23 0.18 12157 95.351 Branch, O
SLTI 22 0.173 12179 95.524 Compare, I
SWCZ 19 0.149 12198 95.673 Store, O
MFCZ 17 0.133 12215 95.806 DM, R - O
OR 16 0.125 12231 95.931 ALU, R
ORI 16 0.125 12247 96.056 ALU, I
AND 11 0.086 12258 96.142 ALU, R
MFHI 10 0.078 12268 96.22 DM, R
BLEZ 9 0.071 12277 96.291 Branch, O
XOR 8 0.063 12285 96.354 ALU, R
DIV 7 0.055 12292 96.409 ALU, R
MFLO 7 0.055 12299 96.464 DM, R
JALR 5 0.039 12304 96.503 Jump, R
248
Instruction Count % Cumulative Cumulative Type
Count %
BGEZ 4 0.031 12308 96.534 Branch, O
LB 4 0.031 12312 96.565 Load, O
MULT 3 0.024 12315 96.589 ALU, R
SLTU 2 0.016 12317 96.605 Compare, R
ADD 0 0 12317 96.605 ALU, R
ADDI 0 0 12317 96.605 ALU, I
DIVU 0 0 12317 96.605 ALU, R
MULTU 0 0 12317 96.605 ALU, R
NOR 0 0 12317 96.605 ALU, R
SLLV 0 0 12317 96.605 ALU, R
SRA 0 0 12317 96.605 ALU, R
SRAV 0 0 12317 96.605 ALU, R
SRL 0 0 12317 96.605 ALU, R
SRLV 0 0 12317 96.605 ALU, R
SUB 0 0 12317 96.605 ALU, R
XORI 0 0 12317 96.605 ALU, I
BCZF 0 0 12317 96.605 Branch, O
BGEZAL 0 0 12317 96.605 Branch, O
BGTZ 0 0 12317 96.605 Branch, O
BLTZAL 0 0 12317 96.605 Branch, O
BLTZ 0 0 12317 96.605 Banch, O
LH 0 0 12317 96.605 Load, O
LHU 0 0 12317 96.605 Load, O
LWL 0 0 12317 96.605 Load, O
LWR 0 0 12317 96.605 Load, O
SH 0 0 12317 96.605 Store, O
SWL 0 0 12317 96.605 Store, O
SWR 0 0 12317 96.605 Store, O
MTHI 0 0 12317 96.605 DM, R
MTLO 0 0 12317 96.605 DM, R
SYSCALL 0 0 12317 96.605 Interrupt, R - O
BREAK 0 0 12317 96.605 Interrupt, R - O
REF 0 0 12317 96.605
----------------------------------------------------------------------------------------------------
Total 12317
----------------------------------------------------------------------------------------------------
249
C. MIPS Code Redundant 0's Distribution: Program: susan
------------------------------------------------------------------------------------------------------
Instruction Count RZ(bits)
------------------------------------------------------------------------------------------------------
ADDU 1986 9930
NOP 608 15808
SUBU 458 2290
SLT 200 1000
LUI 138 690
MTCZ 105 1155
JR 47 752
BCZT 23 92
MFCZ 17 187
OR 16 80
AND 11 55
MFHI 10 150
BLEZ 9 45
XOR 8 40
DIV 7 70
MFLO 7 105
JALR 5 50
BGEZ 4 16
MULT 3 30
SLTU 2 10
ADD 0 0
DIVU 0 0
MULTU 0 0
NOR 0 0
SLLV 0 0
SRAV 0 0
SRLV 0 0
SUB 0 0
BCZF 0 0
BGTZ 0 0
BLTZ 0 0
MTHI 0 0
MTLO 0 0
SYSCALL 0 0
REF 0 0
------------------------------------------------------------------------------------------------------
Total : 32555 bits
TRZ : 4069 bytes
Percentage of TRZ : 7.979
250
------------------------------------------------------------------------------------------------------
D. Branch Instruction Distribution: Program: susan
------------------------------------------------------------------------------------------------------
Instruction Count Percentage Cummulative Count Cummulative
Percentage Type
------------------------------------------------------------------------------------------------------
BEQ 171 1.341 171 1.341
BNE 154 1.208 325 2.549
BCZT 23 0.18 348 2.729
BLEZ 9 0.071 357 2.8
BGEZ 4 0.031 361 2.831
BCZF 0 0 361 2.831
BGEZAL 0 0 361 2.831
BGTZ 0 0 361 2.831
BLTZAL 0 0 361 2.831
BLTZ 0 0 361 2.831
------------------------------------------------------------------------------------------------------
E. WASTIO Calculation Program: susan
------------------------------------------------------------------------------------------------------
Instruction Count a b c
------------------------------------------------------------------------------------------------------
LW 3838 185 1 3652
ADDU 1986 - - -
SW 1304 20 0 1284
LBU 971 739 0 232
ADDIU 691 1 4 686
NOP 608 - - -
SUBU 458 - - -
SLL 423 - - -
SLT 200 - - -
SB 165 39 0 126
LUI 135 1 5 129
251
MTCZ 105 - - -
BEQ 91 0 0 91
BNE 65 0 0 65
SLTIU 52 0 0 52
JR 47 - - -
LWCZ 44 0 0 44
ANDI 40 0 0 40
BCZT 23 0 0 23
SLTI 19 0 0 19
SWCZ 19 0 0 19
MFCZ 17 - - -
OR 16 - - -
AND 11 - - -
MFHI 10 - - -
XOR 8 - - -
DIV 7 - - -
MFLO 7 - - -
BLEZ 5 0 0 5
JALR 5 - - -
LB 4 2 0 2
MULT 3 - - -
BGEZ 3 0 0 3
SLTU 2 - - -
J 1 0 1 0
JAL 1 0 1 0
ADD 0 - - -
ADDI 0 0 0 0
DIVU 0 - - -
MULTU 0 - - -
NOR 0 - - -
ORI 0 0 0 0
SLLV 0 - - -
252
SRA 0 - - -
SRAV 0 - - -
SRL 0 - - -
SRLV 0 - - -
SUB 0 - - -
XORI 0 0 0 0
BCZF 0 0 0 0
BGEZAL 0 0 0 0
BGTZ 0 0 0 0
BLTZAL 0 0 0 0
BLTZ 0 0 0 0
LH 0 0 0 0
LHU 0 0 0 0
LWL 0 0 0 0
LWR 0 0 0 0
SH 0 0 0 0
SWL 0 0 0 0
SWR 0 0 0 0
MTHI 0 - - -
MTLO 0 - - -
SYSCALL 0 - - -
BREAK 0 - - -
REF 0 - - -
------------------------------------------------------------------------------------------------------
WASTIO 1974 12 6472
WASTIO Percentage 3.871 0.024 12.69
------------------------------------------------------------------------------------------------------
Total WASTIO Percentage = 16.585
253
F. Population of frequently used top four instructions (FTFI) Program: susan
------------------------------------------------------------------------------------------------------
Instruction Count
------------------------------------------------------------------------------------------------------
ADDU 1986
ADDIU 1129
LW 3883
SW 1310
------------------------------------------------------------------------------------------------------
Sum of FTFI 8308
Percentage of FTFI 67.451
------------------------------------------------------------------------------------------------------
G. Registers Usage behaviour Program: susan
------------------------------------------------------------------------------------------------------
1. ALU - R Type Instructions (partial)
------------------------------------------------------------------------------------------------------
rs rt rd ADD ADDU AND DIV DIVU MULT MULTU NOR OR SLL SLLV
------------------------------------------------------------------------------------------------------
0 0 0 0 1925 10 7 0 3 0 0 16 420 0
0 0 1 0 5 0 0 0 0 0 0 0 0 0
0 1 0 0 11 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 0 0 3 0
1 0 0 0 5 0 0 0 0 0 0 0 0 0
1 0 1 0 37 1 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 3 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------
Total 0 1986 11 7 0 3 0 0 16 423 0
A16 0 61 1 0 0 0 0 0 0 3 0
------------------------------------------------------------------------------------------------------
R Type Registers Access Summary
------------------------------------------------------------------------------------------------------
Total Access To Registers : 3131
Access To Registers 16 - 31 : 71
Percentage Access To Registers 16 - 31 : 2.268
------------------------------------------------------------------------------------------------------
254
2. ALU - I Type Instructions
------------------------------------------------------------------------------------------------------
rs rt ADDI ADDIU ANDI ORI XORI SLTI SLTIU
------------------------------------------------------------------------------------------------------
0 0 0 1024 40 16 0 22 52
0 1 0 21 0 0 0 0 0
1 0 0 22 0 0 0 0 0
1 1 0 62 0 0 0 0 0
------------------------------------------------------------------------------------------------------
Total 0 1129 40 16 0 22 52
A16 0 105 0 0 0 0 0
-----------------------------------------------------------------------------------------------------
I Type Registers Access Summary
------------------------------------------------------------------------------------------------------
Total Access To Registers : 1259
Access To Registers 16 - 31 : 105
Percentage Access To Registers 16 - 31 : 8.34
-----------------------------------------------------------------------------------------------------
H. Shift Length Usage Program: susan
-----------------------------------------------------------------------------------------------------
sa SLL SRA SRL
------------------------------------------------------------------------------------------------------
0 414 0 0
1 9 0 0
-----------------------------------------------------------------------------------------------------
Total 423 0 0
A16 9 0 0
------------------------------------------------------------------------------------------------------
Shift Amount Summary
-----------------------------------------------------------------------------------------------------
Total usage of shift amount = 423
Number of cases for shift amount between 16 - 31 = 9
Percentage of shift amount between 16 - 31 = 2.128
255
------------------------------------------------------------------------------------------------------
I. Immediate Field Usage pattern (partial) Program: susan
------------------------------------------------------------------------------------------------------
Instruction All 0's 01X 001X 0001X 00001X
------------------------------------------------------------------------------------------------------
ADDIU 1 3 4 0 1 1
ANDI 0 0 0 0 0 0
ORI 0 11 0 0 0 0
XORI 0 0 0 0 0 0
SLTI 0 0 0 0 0 0
SLTIU 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------
Total 1 14 4 0 1 1
Percentage 0.121 1.697 0.485 0 0.121 0.121
------------------------------------------------------------------------------------------------------
J. Offset field Usage pattern (partial) Program: susan
---------------------------------------------------------------------------------------------------------------
Instruction All 0's 01X 001X 0001X 00001X
---------------------------------------------------------------------------------------------------------------
BCZT 0 0 0 0 0 0
BCZF 0 0 0 0 0 0
BEQ 0 0 0 0 0 44
BGEZ 0 0 0 0 0 0
BGEZAL 0 0 0 0 0 0
BGTZ 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------
Total 985 4 9 0 0 55
Percentage 14.768 0.06 0.135 0 0 0.825
256
A2.5.2 HIE1 Code Conversion Report
A. HIE1 Code Redundant 0's Distribution: Program: susan
------------------------------------------------------------------------------------------------------
Instruction Count RZ(bits)
------------------------------------------------------------------------------------------------------
ADDU 1986 0
NOP 608 0
SUBU 458 0
SLT 200 0
LUI 138 0
MTCZ 105 0
JR 47 0
BCZT 23 0
MFCZ 17 0
OR 16 0
AND 11 0
MFHI 10 0
BLEZ 9 0
XOR 8 0
DIV 7 28
MFLO 7 0
JALR 5 20
BGEZ 4 0
MULT 3 12
SLTU 2 0
ADD 0 0
DIVU 0 0
MULTU 0 0
NOR 0 0
SLLV 0 0
SRAV 0 0
SRLV 0 0
SUB 0 0
257
BCZF 0 0
BGTZ 0 0
BLTZ 0 0
MTHI 0 0
MTLO 0 0
SYSCALL 0 0
REF 0 0
------------------------------------------------------------------------------------------------------
Total : 60 bits
TRZ : 7.5 bytes
Percentage of TRZ : 0.02
------------------------------------------------------------------------------------------------------
B. Code sizes summary Program: susan
------------------------------------------------------------------------------------------------------
1. MIPS32 Code size in Bytes = 51000
3. HIE1-MIPS Code size in bytes = 37229
------------------------------------------------------------------------------------------------------
C. Percentage Code Reduction (PCR) Program: susan
---------------------------------------------------------------------------------------------------------------
HIE Code Size Percentage = 72.998
HIE PCR = 27.002
------------------------------------------------------------------------------------------------------
A2.5.3 HIE2 Code Conversion Report
A. HIE2 Code Redundant 0's Distribution: Program: susan
---------------------------------------------------------------------------------------------------------------
Instruction Count RZ(bits)
---------------------------------------------------------------------------------------------------------------
ADDU 1986 0
NOP 608 0
SUBU 458 0
SLT 200 0
LUI 138 0
MTCZ 105 0
258
JR 47 94
BCZT 23 0
MFCZ 17 0
OR 16 0
AND 11 0
MFHI 10 20
BLEZ 9 27
XOR 8 0
DIV 7 0
MFLO 7 14
JALR 5 0
BGEZ 4 12
MULT 3 0
SLTU 2 0
ADD 0 0
DIVU 0 0
MULTU 0 0
NOR 0 0
SLLV 0 0
SRAV 0 0
SRLV 0 0
SUB 0 0
BCZF 0 0
BGTZ 0 0
BLTZ 0 0
MTHI 0 0
MTLO 0 0
SYSCALL 0 0
REF 0 0
---------------------------------------------------------------------------------------------------------------
Total : 167 bits
TRZ : 20.875 bytes
Percentage of TRZ : 0.056
259
------------------------------------------------------------------------------------------------------
B. Code sizes summary Program: susan
------------------------------------------------------------------------------------------------------
1. MIPS32 Code size in Bytes = 51000
3. HIE2-MIPS Code size in bytes = 37214
------------------------------------------------------------------------------------------------------
C. Percentage Code Reduction (PCR) Program: susan
------------------------------------------------------------------------------------------------------
HIE Code Size Percentage = 72.969
HIE PCR = 27.031
A2.5.4 RMA Code Conversion Report
A. RMA Scope Analysis Program: susan
------------------------------------------------------------------------------------------------------
1. No. of successful RMA loads = 2272
2. No. of Unsuccessful RMA loads = 1611
3. Total loads = 3883
4. Percentage of RMA load cases = 58.511
5. No of succesful RMA stores = 43
6. No of Unsuccessful RMA stores = 1267
7. Total stores = 1310
8. Percentage of RMA store cases = 3.282
------------------------------------------------------------------------------------------------------
B. Code sizes summary Program: susan
------------------------------------------------------------------------------------------------------
1. MIPS32 Code size in Bytes = 51000
2. RMA-MIPS Code size in bytes = 41980
260
------------------------------------------------------------------------------------------------------
C. Percentage Code Reduction (PCR) Program: susan
------------------------------------------------------------------------------------------------------
RMA Code Size Percentage = 82.314
RMA PCR = 17.686
------------------------------------------------------------------------------------------------------
A2.5.5 RMA+HIE1 Code Conversion Report
A. RMA Scope Analysis Program: susan
------------------------------------------------------------------------------------------------------
1. No. of successful RMA loads = 2272
2. No. of Unsuccessful RMA loads = 1611
3. Total loads = 3883
4. Percentage of RMA load cases = 58.511
5. No of succesful RMA stores = 43
6. No of Unsuccessful RMA stores = 1267
7. Total stores = 1310
8. Percentage of RMA store cases = 3.282
------------------------------------------------------------------------------------------------------
B. Code sizes summary Program: susan
------------------------------------------------------------------------------------------------------
1. MIPS32 Code size in Bytes = 51000
2. RMA-MIPS Code size in bytes = 41980
3. (RMA+HIE1)-MIPS Code size in bytes = 32503
------------------------------------------------------------------------------------------------------
C. Percentage Code Reduction (PCR) Program: susan
------------------------------------------------------------------------------------------------------
RMA Code Size Percentage = 82.314
RMA PCR = 17.686
RMA+HIE1 Code Size Percentage = 63.731
RMA+HIE1 PCR = 36.269
------------------------------------------------------------------------------------------------------
261
A2.5.6 RMA+HIE2 Code Conversion Report
A. RMA Scope Analysis Program: susan
------------------------------------------------------------------------------------------------------
1. No. of successful RMA loads = 2272
2. No. of Unsuccessful RMA loads = 1611
3. Total loads = 3883
4. Percentage of RMA load cases = 58.511
5. No of succesful RMA stores = 43
6. No of Unsuccessful RMA stores = 1267
7. Total stores = 1310
8. Percentage of RMA store cases = 3.282
------------------------------------------------------------------------------------------------------
B. Code sizes summary Program: susan
------------------------------------------------------------------------------------------------------
1. MIPS32 Code size in Bytes = 51000
2. RMA-MIPS Code size in bytes = 41980
3. (RMA+HIE2)-MIPS Code size in bytes = 32488
C. Percentage Code Reduction (PCR) Program: susan
------------------------------------------------------------------------------------------------------
RMA Code Size Percentage = 82.314
RMA PCR = 17.686
RMA+HIE2 Code Size Percentage = 63.702
RMA+HIE2 PCR = 36.298
------------------------------------------------------------------------------------------------------
A2.6 CROSS COMPILATION PROCEDURE
"Cross compilation" is a process of building executable binaries for
one processor, and running them on another processor whose architecture is
262
different. Cross compilation is required when binary executables are
generated from source code written in a compiled language, like C or C++.
A2.6.1 Using Sourcery Codebench for Cross Compilation
Sourcery CodeBench is a collection of cross compiler tools known as
toolchain for several processor architectures, including ARM, PowerPC,
MIPS, and Intel x86. It also consists of a set of tools like linker, object
dumping tools, library archiving etc. It is specially built for embedded system
and produces optimized object code and executable for all MIPS processors.
A2.6.1.1 Building the C program
This section describes the process of generating the object code from
the C program using Sourcery Code Bench tool. For demonstration of cross
compilation process, MiBench and MediaBench C programs are chosen.
Compile the input C program using the command below in order to obtain
the object code.
$ mips-linux-gnu-gcc <input-C-file-1-name> <input-C-file-2-name>
-o <object-file-name
For example, the following command is used to compile the SHA
benchmark. The option -o is used to specify the output object file name.
$ mips-linux-gnu-gcc sha.c sha_driver.c -o sha
A2.6.1.2 Obtaining the assembly code
Objdump is a program for displaying various information about one or
more object files. It is used as disassembler to view executable in assembly
form. Use the following command to disassemble the object code:
263
$ mips-linux-gnu-objdump -D <object-file-name>
For example the following command is used to disassemble the SHA
object code.
$ mips-linux-gnu-objdump -D sha
The option -D is used to disassemble the contents of all sections of
the program.
264
APPENDIX 3
MIPS32 INSTRUCTION IDENTIFICATION TABLE
Table A3.1 provides information on MIPS32 instruction identification.
The OP indicates 6-bit major opcode in bits 31-26. The OPX indicates the
6-bit opcode extension in bits 5-0. This table is used by MIDACC.
Table A3.1: MIPS32 Instruction Identification Table
(Abbreviations used: R-Register; I-Immediate; O-Offset; T-target address)
Sl.
no.
MIPS32
Instruction OP
OP byte
pattern
(Hexa)
OPX
OPX byte
pattern
(Hexa)
Type,
Format
1 add 000000 00, 01, 02, 03 100000 20, 60,
A0, E0
ALU, R
2 addu ,, 00, 01, 02, 03 100001 21, 61,
A1, E1
ALU, R
3 addi 001000 20, 21, 22, 23 - - ALU, I
4 addiu 001001 24, 25, 26, 27 - - ALU, I
5 and 000000 00, 01, 02, 03 100100 24, 64,
A4, E4
ALU, R
6 andi 001100 30, 31, 32, 33 - - ALU, I
7 div 000000 00, 01, 02, 03 011010 1A, 5A,
9A, DA
ALU, R
8 divu ,, 00, 01, 02, 03 011011 1B, 5B,
9B, DB
,,
9 mult ,, 00, 01, 02, 03 011000 18, 58,
98, D8
,,
10 multu 000000 00, 01, 02, 03 011001 19, 59,
99, D9
,,
11 nor ,, 00, 01, 02, 03 100111 27, 67,
A7, E7
,,
12 or ,, 00, 01, 02, 03 100101 25, 65,
A5, E5
,,
13 ori 001101 34, 35, 36, 37 - - ALU, I
265
Table A3.1 (Continued)
Sl.
no.
MIPS32
Instruction OP
OP byte
pattern
(Hexa)
OPX
OPX byte
pattern
(Hexa)
Type,
Format
14 sll 000000 00, 01, 02, 03 000000 00, 40,
80, C0
ALU, R
15 sllv ,, 00, 01, 02, 03 000100 04, 44,
84, C4
,,
16 sra ,, 00, 01, 02, 03 000011 03, 43,
83, C3
,,
17 srav ,, 00, 01, 02, 03 000111 07, 47,
87, C7
,,
18 srl ,, 00, 01, 02, 03 000010 02, 42,
82, C2
,,
19 srlv ,, 00, 01, 02, 03 000110 06, 46,
86, C6
,,
20 sub ,, 00, 01, 02, 03 100010 22, 62,
A2, E2
,,
21 subu ,, 00, 01, 02, 03 100011 23, 63,
A3, E3
,,
22 xor ,, 00, 01, 02, 03 100110 26, 66,
A6, E6
,,
23 xori 001110 38, 39, 3A,
3B
- - ALU, I
24 lui 001111 3C, 3D, 3E,
3F
- - CONMANIP, I
25 slt 000000 00, 01, 02, 03 101010 2A, 6A,
AA, EA
Compare, R
26 sltu ,, 00, 01, 02, 03 101011 2B, 6B,
AB, EB
,,
27 slti 001010 28, 29, 2A,
2B
- - Compare, I
28 sltiu 001011 2C, 2D, 2E,
2F
- - ,,
29 bczt - 41, 45, 49,
4D; next byte:
01, 03, 05,
07, 09, 0B,
0D, 0F, 11,
13, 15, 17,
19, 1B, 1D,1F
- - Branch, O
266
Table A3.1 (Continued)
Sl.
no.
MIPS32
Instruction OP
OP byte
pattern
(Hexa)
OPX
OPX byte
pattern
(Hexa)
Type,
Format
30 bczf - 41, 45, 49,
4D: next byte:
00, 02, 04,
06, 08, 0A,
0C,0E, 10,
12, 14, 16,
18, 1A, 1C,
1E
- - Branch, O
31 beq 000100 10, 11, 12, 13 - - ,,
32 bgez 000001 04, 05, 06,
07; next byte:
01, 21, 41,
61, 81, A1,
C1, E1
- - ,,
33 bgezal ,, 04, 05, 06,
07; next byte:
11, 31, 51,
71, 91, B1,
D1, F1
- - ,,
34 bgtz 000111 1C, 1D, 1E,
1F; next byte:
00, 20, 40,
60, 80, A0,
C0, E0
- - ,,
35 blez 000110 18, 19, 1A,
1B
- - ,,
36 bltzal 000001 04, 05, 06,
07; next byte:
10, 30, 50,
70, 90, B0,
D0, F0
- - ,,
37 bltz ,, 04, 05, 06,
07; next byte:
00, 20, 40,
60, 80, A0,
C0, C0, E0
- - ,,
38 bne 000101 14, 15, 16, 17 - - ,,
39 j 000010 08, 09, 0A,
0B
- - Jump, T
267
Table A3.1 (Continued)
Sl.
no.
MIPS32
Instruction OP
OP byte
pattern
(Hexa)
OPX
OPX byte
pattern
(Hexa)
Type,
Format
40 jal 000011 0C, 0D, 0E,
0F
- - Jump, T
41 jalr 000000 00, 01, 02, 03 001001 09, 49,
89, C9
Jump, R
42 jr 000000 00, 01, 02, 03 001000 08, 48,
88, C8
,,
43 lb 100000 80, 81, 82, 83 - - Load, O
44 lbu 100100 90, 91, 92, 93 - - ,,
45 lh 100001 84, 85, 86, 87 - - ,,
46 lhu 100101 94, 95, 96, 97 - - ,,
47 lw 100011 8C, 8D, 8E,
8F
- - ,,
48 lwcz - C0, C1, C2,
C3, C4, C5,
C6, C7, C8,
C9, CA, CB,
CC, CD, CE,
CF
- - ,,
49 lwl 100010 88, 89, 8A,
8B
- - ,,
50 lwr 100011 98, 99, 9A,
9B
- - ,,
51 sb 101000 A0, A1, A2,
A3
- - Store, O
52 sh 101001 A4, A5, A6,
A7
- - ,,
53 sw 101011 AC, AD, AE,
AF
- - ,,
54 swcz - E0, E1, E2,
E3, E4, E5,
E6, E7, E8,
E9, EA, EB,
EC, ED, EE,
EF
- - ,,
55 swl 101010 A8, A9, AA,
AB
- - ,,
56 swr 101110 B8, B9, BA,
BB
- - ,,
268
Table A3.1 (Continued)
Sl.
no.
MIPS32
Instruction OP
OP byte
pattern
(Hexa)
OPX
OPX byte
pattern
(Hexa)
Type,
Format
57 mfhi 000000 00, 01, 02, 03 010000 10, 50,
90, D0
Data move, R
58 mflo ,, 00, 01, 02, 03 010010 12, 52,
92, D2
,,
59 mthi ,, 00, 01, 02, 03 010001 11, 51,
91, D1
,,
60 mtlo ,, 00, 01, 02, 03 010011 13, 53,
93, D3
,,
61 mfcz - 40, 44, 48,
4C, 50, 54,
58, 5C, 60,
64, 68, 6C,
70, 74, 78,
7C; next byte:
00 to 0F, 10
to 1F
- - Data move, R
– O
62 mtcz - 40, 44, 48,
4C, 50, 54,
58, 5C, 60,
64, 68, 6C,
70, 74, 78,
7C; next byte:
80 to 8F, 90
to 9F
- - ,,
63 syscall 000000 00, 01, 02, 03 001100 0C, 4C,
8C, CC
Interrupt, R-O
64 break ,, 00, 01, 02, 03 001101 0D, 4D,
8D, CD
,,
65 nop ,, All 0’s in all
32 bits
- 0’s -
66 rfe 010000 40, 41, 42, 43 100000 20, 60,
A0, E0
Interrupt
269
APPENDIX 4
HIE1-MIPS INSTRUCTION MAP
The HIE-1 MIPS instruction map is given in Table A4.1. The iid field
encoding is given in Table A4.2.
Table A4.1: HIE1-MIPS instruction map
Sl no.
MIPS32 Instruction
Type,
Format HIE1 Type
HIE1 length (bits)
HIE1 OP HIE1 OPX
1 add ALU, R D 24 000000 100000
2 addu ALU, R D 24 ,, 100001
3 addi ALU, I G 16/24/32 001000 -
4 addiu ALU, I G ,, 001001 -
5 and ALU, R D 24 000000 100100
6 andi ALU, I G 16/24/32 001100 -
7 div ALU, R F 24 000000 011010
8 divu ,, F ,, ,, 011011
9 mult ,, F 24 ,, 011000
10 multu ,, F 24 000000 011001
11 nor ,, D ,, ,, 100111
12 or ,, D ,, ,, 100101
13 ori ALU, I G 16/24/32 001101 -
14 sll ALU, R E 24 000000 000000
15 sllv ,, D ,, ,, 000100
16 sra ,, E ,, ,, 000011
17 srav ,, D ,, ,, 000111
18 srl ,, E ,, ,, 000010
19 srlv ,, D ,, ,, 000110
20 sub ,, D ,, ,, 100010
270
Table A4.1 (Continued)
Sl no.
MIPS32 Instruction
Type,
Format HIE1 Type
HIE1 length (bits)
HIE1 OP HIE1 OPX
21 subu ,, D ,, ,, 100011
22 xor ,, D ,, ,, 100110
23 xori ALU, I G 16/24/32 001110 -
24 lui CONMANIP,
I
G 16/24/32 001111 -
25 slt Compare, R D 24 000000 101010
26 sltu ,, D ,, ,, 101011
27 slti Compare, I G 16/24/32 001010 -
28 sltiu ,, G 16/24/32 001011 -
29 bczt¹ Branch, O G 16/24/32 0100xx -
30 bczf¹ ,, G 16/24/32 0100xx -
31 beq ,, G 16/24/32 000100 -
32 bgez¹ ,, G 16/24/32 000001 -
33 bgezal¹ ,, G 16/24/32 ,, -
34 bgtz¹ ,, G 16/24/32 000111 -
35 blez ,, G 16/24/32 000110 -
36 bltzal ,, G 16/24/32 000001 -
37 bltz ,, G 16/24/32 ,, -
38 bne ,, G 16/24/32 000101 -
39 j Jump, T H 32 000010 -
40 jal ,, H 32 000011 -
41 jalr Jump, R F 24 000000 001001
42 jr ,, C 16 000000 001000
43 lb Load , O G 16/24/32 100000 -
44 lbu ,, G 16/24/32 100100 -
45 lh ,, G 16/24/32 100001 -
46 lhu ,, G 16/24/32 100101 -
271
Table A4.1 (Continued)
Sl no.
MIPS32 Instruction
Type, Format HIE1 Type
HIE1 length (bits)
HIE1 OP HIE1 OPX
47 lw ,, G 16/24/32 100011 -
48 lwcz¹ ,, G 16/24/32 - -
49 lwl ,, G 16/24/32 100010 -
50 lwr ,, G 16/24/32 100011 -
51 sb Store, O G 16/24/32 101000 -
52 sh ,, G 16/24/32 101001 -
53 sw ,, G 16/24/32 101011 -
54 swcz¹ ,, G 16/24/32 - -
55 swl ,, G 16/24/32 101010 -
56 swr ,, G 16/24/32 101110 -
57 mfhi Data move, R C 16 000000 010000
58 mflo ,, C 16 ,, 010010
59 mthi ,, C 16 ,, 010001
60 mtlo ,, C 16 ,, 010011
61 mfcz¹ Data move, R
– O
B; iid = 0 16 - -
62 mtcz¹ ,, B; iid = 1 24 - -
63 syscall Interrupt, R-O A ; iid = 01 8 011100 -
64 break ,, I 32 000000 001101
65 nop - A; iid = 10 8 011100 -
66 rfe Interrupt A; iid = 00 16 011100 -
272
Note-1:
Coprocessor related instructions are not fully mapped to HIE1 being
static simulation purpose. Identifying certain instructions involve multiple
match conditions. For example, for bczt instruction, the first byte may be
any one of the four combinations: 41, 45, 49,4D. In addition, the second
byte has 16 combinations: 01, 03, 05, 07, 09, 0B, 0D, 0F, 11, 13,
15,17,19,1B,1D,1F. Appendix 3 gives complete information.
Table A4.2: IID Field Encoding
Group IId Encoding
A iid Instruction
00 rfe
01 syscall
10 nop
11 -
B iid Instruction
0 mfcz
1 mtcz
273
APPENDIX 5
HIE2-MIPS INSTRUCTION MAP
A5.1 HIE2-MIPS INSTRUCTION MAP
The HIE-2 MIPS instruction map is given in table A5.1
Table A5.1: HIE2-MIPS INSTRUCTION MAP
(IT bit = 0 indicates presence of hl field)
Sl.
no.
MIPS32
Instruction
Type,
Format
HIE
Type
HIE
length
(bits)
IT HIE
OP iid
1 add ALU, R D1 24 1 00101 000
2 addu ALU, R ,, 24 ,, ,, 001
3 addi ALU, I G1 24/32 0 00001 -
4 addiu ALU, I ,, ,, ,, 00010 -
5 and ALU, R D1 24 1 00101 010
6 andi ALU, I G1 24/32 0 00011 -
7 div ALU, R F 16 1 01000 -
8 divu ,, ,, ,, ,, 01001 -
9 mult ,, ,, ,, ,, 01010 -
10 multu ,, ,, ,, ,, 01011 -
11 nor ,, D1 24 ,, 00101 011
12 or ,, ,, ,, ,, ,, 100
13 ori ALU, I G1 24/32 0 00100 -
14 sll ALU, R E 24 1 00111 00
15 sllv ,, D2 ,, ,, 00110 000
16 sra ,, E ,, ,, 00111 01
17 srav ,, D2 ,, ,, 00110 001
274
Table A5.1 (Continued)
Sl.
no.
MIPS32
Instruction
Type,
Format
HIE
Type
HIE
length
(bits)
IT HIE
OP iid
18 srl ,, E ,, ,, 00111 10
19 srlv ,, D2 ,, ,, 00110 010
20 sub ,, D1 ,, ,, 00101 101
21 subu ,, D1 ,, 1 00101 110
22 xor ,, ,, ,, ,, ,, 111
23 xori ALU, I G1 24/32 0 00101 -
24 lui CONMA-
NIP, I
,, 24/32 ,, 00110 -
25 slt Compare, R D2 24 1 00110 011
26 sltu ,, ,, ,, ,, ,, 100
27 slti Compare, I G1 24/32 0 00111 -
28 sltiu ,, ,, 24/32 ,, 01000 -
29 bczt Branch, O G2 8/16/ 24 ,, 11011 -
30 bczf ,, G2 8/16/ 24 ,, 11100 -
31 beq ,, G1 24/32 ,, 01001 -
32 bgez ,, G3 16/24/32 ,, 11101 00
33 bgezal ,, G1 24/32 ,, 01010 -
34 bgtz ,, G3 16/24/32 ,, 11101 01
35 blez ,, G3 16/24/32 ,, ,, 10
36 bltzal ,, G1 24/32 ,, 01011 -
37 bltz ,, G3 16/24/32 0 11101 11
38 bne ,, G1 24/32 ,, 01100 -
39 j Jump, T H 32 1 01101 -
40 jal ,, H 32 ,, 01110 -
41 jalr Jump, R F 16 ,, 01100 -
42 jr ,, C 16 ,, 00100 000
275
Table A5.1 (Continued)
Sl.
no.
MIPS32
Instruction
Type,
Format
HIE
Type
HIE
length
(bits)
IT HIE
OP iid
43 lb Load , O G1 24/32 0 01101 -
44 lbu ,, G1 24/32 ,, 01110 -
45 lh ,, G1 24/32 ,, 01111 -
46 lhu ,, G1 24/32 ,, 10000 -
47 lw ,, ,, 24/32 ,, 10001 -
48 lwcz ,, ,, 24/32 ,, 10010 -
49 lwl ,, ,, 24/32 ,, 10011 -
50 lwr ,, ,, 24/32 ,, 10100 -
51 sb Store, O ,, 24/32 ,, 10101 -
52 sh ,, ,, 24/32 ,, 10110 -
53 sw ,, ,, 24/32 ,, 10111 -
54 swcz ,, ,, 24/32 ,, 11000 -
55 swl ,, G1 24/32 0 11001 -
56 swr ,, ,, 24/32 ,, 11010 -
57 mfhi Data move,
R
C 16 1 00100 001
58 mflo ,, ,, 16 ,, ,, 010
59 mthi ,, ,, 16 ,, ,, 011
60 mtlo ,, ,, 16 ,, ,, 100
61 mfcz Data move,
R – O
B 16 ,, 00010 -
62 mtcz ,, ,, 24 ,, 00011 -
63 syscall Interrupt, R-
O
A 8 ,, 00001 00
64 break ,, I 32 ,, 01111 -
65 nop - A 8 ,, 00001 01
66 rfe Interrupt A 16 ,, 00001 10
276
Note-1:
Coprocessor related instructions are not mapped to HIE2 fully being
static simulation purpose. Identifying certain instructions involve multiple
match conditions. For example, for bczt instruction, the first byte may be
any one of the four combinations: 41,45,49,4D. In addition, the second byte
has 16 combinations: 01, 03, 05, 07, 09, 0B, 0D, 0F, 11, 13, 15, 17, 19,
1B, 1D, 1F. Appendix 3 gives complete information.
A5.2 HYBRID LENGTH FIELDS ENCODING
The hl Encoding for hybrid immediate / offset lengths are as follows.
Table A5.2 and table A5.3 gives hl Encoding for G1 type instructions and
G2 type instructions respectively.
Table A5.2: hl Encoding for G1 Type Instructions
Actual contents of
immediate / offset
(15 bits)
hl
bit
Immediate /
offset (bits)
Instruction
size (bits)
Encoding of
Immediate /
offset field
All 0's 0 7 24 7 zeros
Eight most significant
bits are 0's and value of
remaining seven bits
are non zero.
0 7 24 7 lsbs of actual
contents
Value of eight most
significant bits non zero
1 15 32 Actual
contents
277
Table A5.3: hl Encoding for G2 Type Instructions
Actual contents of
offset
hl
bits
offset
(bits)
Instruction
size (bits)
Encoding of
offset field
All 0's 00 0 8 -
All bits of most significant
byte are 0's and value of
least significant byte non
zero.
01 8 16 8 lsbs of actual
contents
All bits of least significant
byte are 0's and value of
most significant byte non
zero
10 8 16 8 msbs of actual
contents
Value of both bytes non
zero
11 16 24 Actual 16 bit
contents
The hl Encoding for G3 type instructions is given in Table A5.4.
Table A5.4: hl Encoding for G3 Type Instructions
Actual contents of
offset
hl
bits
Length of
offset in
HIE2 (bits)
Instruction
size (bits)
Encoding of
offset field
All 0's 00 0 16 -
All bits of most significant
byte are 0's and value of
least significant byte non
zero.
01 8 24 8 lsbs of actual
contents
All bits of least significant
byte are 0's and value of
most significant byte non
zero
10 8 24 8 msbs of actual
contents
Values of both bytes non
zero
11 16 32 Actual 16 bit
contents
278
TECHNICAL BIOGRAPHY
Mr. B. Govindarajalu (RRN. 1186221) was born on 3rd Jan 1949,
in Athukkudi, Tamilnadu. He did his schooling in Board High School,
Vaithiswarankoil and secured 77% in the Higher Secondary Examination.
He has received gold medal for scoring first rank in Tamil in Pre University
Course in the University of Madras in 1967. He received B.E. degree in
Electronics and Communications Engineering from National Institute of
Technology, Trichy, University of Madras, in the year 1972. He did his
Masters in M.Tech. Computer Science and Engineering from Indian
Institute of Technology, Bombay in the year 1979. He has got over
42 years of working experience including thirty years of industrial
experience. He was employed with M/s. IIT Bombay, ORG Systems,
Baroda, Infotech Limited, Chennai, Manipal Engineering College,
Rajalakshmi Engineering College, Sree Ramanujar Engineering College,
Dhanalakshmi College of Engineering, Sri Venkateswara College of
Engineering and Microcode, Chennai. He is the founder CEO of Microcode,
Chennai. He has authored two books:
1. IBM PC AND CLONES: Hardware, Troubleshooting and
Maintenance
2. Computer Architecture and Organization: Design Principles
and Applications
He is currently pursuing his Ph.D. Degree in Embedded Systems
and RISC Processors in the Department of Computer Science and
Engineering of B.S. Abdur Rahman University. His areas of interests
include Computer Architecture, Embedded Systems and Computer
Networking. He has published two papers in the journals and authored
twelve articles in a computer magazine. The e-mail ID is:
[email protected] and the contact number is : 9884025129.