SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURESFOR VLIW PROCESSORS
NEERAJ GOEL
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY DELHI
AUGUST 2010
c© Indian Institute of Technology Delhi (IITD), New Delhi, 2010. All rights reserved.
SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURESFOR VLIW PROCESSORS
byNeeraj Goel
Department of Computer Science and Engineering
Submitted
in fulfillment of the requirements of the degree of
Doctor of Philosophy
to the
Indian Institute of Technology Delhi
August 2010
Certificate
This is to certify that the thesis titled “SCALABLE LOW ENERGY REGISTER FILE ARCHI-
TECTURES FOR VLIW PROCESSORS” being submitted by Neeraj Goel to the Indian Institute
of Technology Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona-fide
research work carried out by him under our supervision. In our opinion, the thesis has reached the
standards fulfilling the requirements of the regulations relating to the degree.
The results contained in this thesis have not been submitted to any other university or institute for
the award of any degree or diploma.
Anshul Kumar
Professor
Department of Computer Science and Engineering
Indian Institute of Technology Delhi, New Delhi 110 016
Preeti Ranjan Panda
Professor
Department of Computer Science and Engineering
Indian Institute of Technology Delhi, New Delhi 110 016
Acknowledgments
I would like to take this opportunity to thank all those who helped me in making my PhD dissertation
a successful attempt. First, I would like to thank my supervisors Prof. Anshul Kumar and Dr. Preeti
Ranjan Panda, without their suggestions, technical guidance and constructive feedback this thesis
would not have been in this shape. Moreover, they gave me friendly environment, freedom to work
in my own way, gave time when it was required, which made my PhD experience a unique one.
I express my gratitude for Prof. M. Balakrishnan who encouraged me to do PhD and helped me
throughout my stay in IIT Delhi by providing day-to-day suggestions, feedbacks, and encourage-
ment. I would also like to thank, Dr. Kolin Paul and Prof. Ranjan Bose, who gave useful feedbacks
in my SRC presentations.
I am grateful to my seniors, Anup Gangwar, Basant Dwivedi and Satyakiran Munaga for their
encouragement and guidance during PhD. I would like to express my gratitude to Sonali Chouhan,
for her extended support and many informal discussions. Discussions with my PhD colleagues,
Aryabartta Sahu, Anant Vishnoi, Nagaraju Pothineni, Lava Bhargawa, BVN Silpa, G Krishniah and
Vikram Goyal helped me at various points.
During my PhD, I got the opportunity to get closely involved in various B. Tech. and M. Tech.
projects. These projects helped in increasing the breadth of my understanding. Specially, I would
like to thank Manoj Gupta, Rakesh Nalluri, Devdutt, Ramakrishna, Monika Gupta, Kiran Chan-
dramohan for working with me, and sharing their thoughts.
I would like to thank member of the lab staff, Vandana Ahulwalia (Philips VLSI lab) and Somdutt
ix
Sharma (DHD lab), who made available their support in number of ways.
I am indebted to my father, mother, and my sisters, for their endless love, immense patience and
moral support. Support from my wife, Deepika during my thesis writing stage was very crucial.
Last but not the least, I owe my deepest gratitude to Almighty, who makes everything happen,
who brings the ideas in one’s mind, creates environment to cherish those ideas, and motivates one
to implement the ideas.
August 2010 Neeraj Goel
Abstract
Multiported register files (RF) consume a significant fraction of energy in VLIW processors. Due
to large number of ports they do not scale well with increase in number of function units (FU). We
observe that bandwidth provided by RF is also not fully utilized by the processor. Also, in most
applications, many variables are short-lived, i.e., they are produced and consumed within a short
duration.
Observing this we propose a two level register file architecture, where at first level there are local
buffers associated with each function unit (FU) and second level is a monolithic RF. We explore
different architecture options for local buffer based architecture. As most accesses to RF will be
from first level RF, we propose to reduce the number of ports of second level RF by sharing them
among FUs. However, port sharing among FUs may lead to access conflicts and thus reduced per-
formance. Further, port sharing may offset some of the energy savings port reduction brought in.
To address these issues, our solution includes a carefully designed RF-FU interconnection network
which permits port sharing with minimum conflicts and energy overheads. To minimize the perfor-
mance loss due to conflicts and maximize energy savings by increasing the accesses to local buffers,
we propose a novel scheduling and binding algorithm.
To estimate the effect of number of ports on the performance and energy we developed analytical
models. With help of our analytical model and a number of experiments we established that the
proposed architecture leads to as much as 74% register file energy savings with not more than 5%
loss in performance for a 4 issue width processor. Experiments on different issue width processors
xi
reveals that proposed architecture is scalable in both performance as well as energy.
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previously Proposed RF Architectures . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Single Level and Monolithic RF Architecture . . . . . . . . . . . . . . . . 5
1.2.2 Single Level Multibanked RF Architecture . . . . . . . . . . . . . . . . . 5
1.2.3 Two Level RF Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Proposed Solutions: Local Buffers Based RF Architecture . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Proposed RF Architecture 11
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Local Buffer Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 RISO Operand Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 SIRO Result Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 RIRO Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 SIRO Buffers and Conventional VLIW Architecture . . . . . . . . . . . . . . . . . 24
xiii
2.3.1 SIRO Buffers and RF Bypass . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Advantages of SIRO Buffers . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Reduced Port Second Level RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Processor Design with Shared Port RF . . . . . . . . . . . . . . . . . . . . 28
2.5 Issues with the Proposed Architectures . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Code Generation for Proposed Architecture 39
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Scheduling-binding Problem and Methodology . . . . . . . . . . . . . . . . . . . 40
3.3 Proposed Scheduling and Binding Algorithm . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Scheduling Priority Function . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 RF Port Aware Scheduling and Binding . . . . . . . . . . . . . . . . . . . 44
3.3.3 Iterative Schedule Improvement . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Additional Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1 Identifying Global and Local Reads and Writes . . . . . . . . . . . . . . . 55
3.4.2 Code Generation: Register Renaming . . . . . . . . . . . . . . . . . . . . 56
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Performance and Energy Models 59
4.1 Model for Fixed Issue-width Processor . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.2 RF Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Modeling of a Generic Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 RF Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Model Validation and Evaluation of the Proposed Architecture 69
5.1 Implementation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Performance Model for Fixed Issue Width Processor . . . . . . . . . . . . 73
5.2.2 Performance Model for Generic Processor . . . . . . . . . . . . . . . . . . 73
5.3 Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 Number of SIRO Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.4 Direct Interconnect Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.5 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Varying Issue Width and Scalability 89
6.1 RF and Processor Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.1 Related Work in Processor Scalability . . . . . . . . . . . . . . . . . . . . 90
6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5 Clustered VLIW and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7 Conclusions and Future Work 101
7.1 Contributions and Major Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
References 105
List of Figures
1.1 Architecture of a VLIW processor. . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Example of short life time of variables. . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Percentage global values accesses in various applications. . . . . . . . . . . . . . . 13
2.3 Cumulative number of reads for different number of cycles after write. . . . . . . . 14
2.4 RF read port usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Base local buffer model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 RISO buffer based VLIW Architecture. . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 An example of instructions for RISO based architecture. . . . . . . . . . . . . . . 19
2.8 Detailed RISO Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 SIRO buffer based VLIW Architecture. . . . . . . . . . . . . . . . . . . . . . . . 21
2.10 Detailed SIRO Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.11 Example instructions for SIRO based architecture. . . . . . . . . . . . . . . . . . . 22
2.12 SIRO buffers and RF bypass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.13 Bypass control for one operand of a functional unit. . . . . . . . . . . . . . . . . . 27
2.14 RF-FU interconnection topologies for shared ported register file. . . . . . . . . . . 29
2.15 Direct interconnection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.16 Example direct interconnects and corresponding interconnection matrices. . . . . . 32
3.1 An abstract view of reservation table. . . . . . . . . . . . . . . . . . . . . . . . . . 44
xvii
xviii List of Figures
3.2 Example data-flow graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Operations binding example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Schedule for the 4 issue slot VLIW processor with 4 read port and 4 write port RF. 51
3.5 Example binding conflict graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Example binding conflict graph after binding of OP5 and OP6. . . . . . . . . . . . 53
3.7 Schedule for the 4 issue slot VLIW processor with 4 read port and 3 write port RF. 54
3.8 Example:Register renaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Basic block diagram of the ILP model. . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Experiment framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Model validation against simulation results. . . . . . . . . . . . . . . . . . . . . . 74
5.3 Model validation for different issue width processors and different read write port
configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 SIRO buffer reads for different issue processors. . . . . . . . . . . . . . . . . . . . 77
5.5 Direct interconnection RF architecture exploration. . . . . . . . . . . . . . . . . . 79
5.6 Different direct RF configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Performance evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8 Effectiveness of RPA scheduling algorithm with respect to a naive algorithm (Set II
benchmarks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9 Normalized average RF energy for the direct and the complete interconnect topologies. 85
5.10 Normalized RF energy for different benchmarks. . . . . . . . . . . . . . . . . . . 86
6.1 Performance for different issue width processors. . . . . . . . . . . . . . . . . . . 94
6.2 Normalized cycle-delay product for different issue width processors . . . . . . . . 95
6.3 Total RF energy for different issue width processors . . . . . . . . . . . . . . . . . 98
6.4 Normalized performance for clustered VLIW processors . . . . . . . . . . . . . . 99
List of Tables
3.1 Type of operations that can be executed on each function unit . . . . . . . . . . . . 48
5.1 Function unit positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Benchmark characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Comparison of processor core area values with/out SIRO buffer information . . . . 77
5.4 Interconnect matrices for different direct RF configurations . . . . . . . . . . . . . 79
6.1 High ILP Benchmark details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Medium ILP Benchmark details . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xix
xx List of Tables
1 Introduction
1.1 Motivation
In recent years, embedded systems have seen a remarkable change. There is a drift from control
dominated applications to computation intensive applications. For example, embedded systems
such as smart phones, music players, and DVD players, execute a number of computation intensive
applications. These systems require high performance processors to meet computation demand.
The processors also need to be less energy consuming as most of these systems are battery driven.
Traditional micro-controllers and RISC based processors consume very less energy but often do
not meet the performance requirement. Therefore, these are not preferred for high performance sys-
tems. On the other extreme processors based on superscalar architectures are high performance but
also consume a lot of energy. In superscalar processors, instruction level parallelism is determined
by hardware and multiple instructions are executed concurrently. Finding parallel instructions in
hardware leads to complex logic and high energy consumption. Such processors are therefore more
suitable for systems where energy constraint is not very stringent.
In between the above two extremes, there is a choice of very large instruction word (VLIW)
processors. In VLIW processors instruction level parallelism is determined at compile time. Sim-
ilar to superscalar processors multiple operations are executed concurrently but without complex
hardware. Therefore, VLIW processors meet the requirement of high performance as well as low
energy.
Various kinds of application specific processors such as DSP processors are also used for high
end embedded applications. These also often have VLIW type of instruction level parallelism.
1
1 Introduction
There are many examples of commercial processors with VLIW architectures, such as ST Micro-
electronics’s Lx [Faraboschi et al., 2000], Intel’s Itanium [McNairy and Soltis, 2003], TI’s 320C6x
[Seshan, 1998], NXP’s Trimedia [van Eijndhoven et al., 1999] and Analog Device’s TigerSharc
[Fridman and Greenfield, 2000].
In recent years, a trend towards multi-core architectures has been observed. Multi-core archi-
tectures also give the benefits of low energy and higher performance. Thread level parallelism
available in applications is exploited in multi-core architectures to enhance performance. VLIW
approach is orthogonal to multi-core approach as each of the core can be a VLIW processor, and
therefore, both instruction level as well as thread level parallelism can be exploited to achieve higher
performance. Intel’s Itanium 3, Fujitsu’s FR1000 [Shiota et al., 2005], SiliconHive’s Avispa [sil],
and Tilera’s TILE64 [til] are a few examples of multi-core architectures with VLIW processor as
the core. There are also evidences of VLIW processor being used as one of the core [Stolberg
et al., 2005] in multi-core design. Therefore, study of VLIW architectures is important for high end
embedded applications.
Figure 1.1 shows the simplified architectural view of a typical VLIW processor with issue width
N. Issue width of a processor is defined as the maximum number of operations that can be executed
in parallel. An instruction containing N operations is read from the instruction memory in the fetch
stage. In the decode stage, the opcode is decoded and operands are read from the register file (RF).
In the execute stage, the operations are executed. If the operation is a memory operation, in the
memory stage it reads the data from or writes it to the data memory. The results are written back
to the RF in the write back stage. All these stages are pipelined such that a new instruction can be
fetched every cycle. To avoid data hazards, results produced by the FUs are provided at the input of
the FUs via bypass paths. In different commercial VLIW processors, the basic architecture remains
the same, though there may be extra hardware or different number of pipeline stages to improve the
performance or energy of the processor.
For higher performance when we increase the issue width of a processor, the number of ports of
instruction memory remains the same (only width of memory bus increases), ports of data-memory
2
1.1 Motivation
Opcode src1 src2 dest Opcode src1 src2 destOpcode src1 src2 dest
FetchInstruction Memory
Operation 1 Operation 2
Decode Decode Decode and
Read
Execute
Memory
FU 1 FU 2
Register File
Data Memory
FU N
Writeback
Operation N
Decode
Register
Bypass Paths
Figure 1.1: Architecture of a VLIW processor.
increases only if memory function units are added, but ports of RF always increases. In other words,
RF is always effected when there is a change in the issue width. If one function unit (FU) requires
two read ports and one write port, then for N FUs, 2N read and N write RF ports are usually present
3
1 Introduction
with 1-to-1 connection between the FU ports and the RF ports. It has been observed by Zyuban
and Kogge [1998] that RF power increases super-linearly (N 2 to N3) with number of ports. Rixner
et al. [2000] have shown that area and access time of the RF increase at the order of N 3 and N3/2,
respectively. Experimental data also suggests that in VLIW processors multiported RF consumes a
significant fraction of total processor energy [van de Waerdt et al., 2005; Lambrechts et al., 2005].
As the number of FUs increases in a VLIW processor, port requirement of the RF increases.
With the increase in the number of ports, area, power, and access time of the RF increases super-
linearly. Due to this the RF is the most unscalable component in high issue width VLIW processors.
Therefore, there is a need to design an RF which is more scalable as the number of FUs increases
in a VLIW processor.
Problem Statement
The objective of this research is to design an RF architecture that is low energy consuming as
well as scalable in terms of performance for VLIW processors. In VLIW processors, order of
execution as well as the operations that execute in parallel is determined by the compiler. Therefore,
designing the RF architecture of a VLIW processor also involves associated compiler development.
The compiler algorithms are necessary for correct functioning of the modified hardware as well as
for enhancing the performance of the processor. In this thesis we focus on energy as well as the
scalability aspects of the RF for VLIW processors. Scalability of the RF is determined in terms of
area, execution cycles, execution time, and energy.
There has been a number of RF architectures proposed in the past for different processors and
with different motivations. Before going ahead, first we look at these architectures.
1.2 Previously Proposed RF Architectures
We classify the RF previously proposed architectures in three categories: first, single level mono-
lithic RF, second, single level multibanked RF, and multiple level and monolithic RF at each level.
4
1.2 Previously Proposed RF Architectures
1.2.1 Single Level and Monolithic RF Architecture
Monolithic or centralized RF architecture is the RF architecture used in traditional design of pro-
cessors. Various techniques have been proposed to optimize the monolithic RF. Sangireddy [2007]
suggests to reduce the number of RF ports such that issue logic selects the instructions based on its
number of operands. Instructions with two operands are issued to specific slots and instruction with
lower number of operand requirement are issued to other slots. Park et al. [2002] further reduce
RF ports by reducing the operand requirement by not reading those values which are available in
bypass paths.
Use of packing more number of variables in a single register has been suggested to increase
the effective number of registers in a given register file. Ergin et al. [2004] suggest using a single
register to store more than one value of smaller bit-widths. They also suggest bit-width aware
register allocation in hardware. Kondo and Nakamura [2005] suggest using different banks for
lower significant bits and upper significant bits. If one subword has all zero bits, that subword is
released at writeback stage and can be used by other operands. Gonzalez et al. [2004a] allocate
same RF space to two variables if the values are same for the two. In another approach, bit width
awareness is used to reduce the width of a few RF ports [Aggarwal and Franklin, 2003]. Overall
number of ports remain unchanged, but due to width reduction of some ports, the RF energy is
saved.
1.2.2 Single Level Multibanked RF Architecture
In this class of architectures there are multiple register files, each with less number of registers and
ports. If each RF bank is connected to all FUs than it is termed banked RF in literature. If each RF
bank is connected to a subset of FUs, it is called clustered architecture. Both of these architectures
are discussed next.
5
1 Introduction
Banked RF Architecture
The banked RF mimics the behavior of single RF with large number of ports. However, there can
be conflicts in accessing an RF bank, e.g., if an RF bank has single read port than only one FU can
read the bank in a cycle. Conflicts in accessing banks leads to performance penalties. In superscalar
processors, the conflict management is done in hardware, while for VLIW processors, the conflicts
are managed at compile time.
To avoid the conflicts Balasubramonian et al. [2001] suggest to read partially, i.e., if one operand
is available from a bank that operand is read and latched till the other operand is read from the other
RF bank. Tseng and Asanovic [2003] suggest to dividing the RF ports in left-ports and right-ports.
Left ports and right ports are connected to left port and right port of FUs, respectively. This reduces
the size of the crossbar connecting FU ports and RF ports. In their case port arbitration is done in
a separate pipeline stage. To reduce the conflicts authors also suggest using values from the bypass
network. Pericas et al. [2004] also suggest to resolve the conflicts by an arbiter. Ayala et al. [2004]
suggest an approach in which register allocation by compiler partially controls the register renaming
to reduce the conflicts. Conflict management is still done in the hardware. In all these techniques
[Tseng and Asanovic, 2003; Pericas et al., 2004; Ayala et al., 2004], in case of a port conflict, the
instruction either waits for the port or is killed and reissued.
In RISC and VLIW processors there is no register renaming performed in hardware, so the com-
piler can allocate registers of different banks to different operands and there is no port conflict
encountered in the hardware. In [Llosa et al., 1994, 1995] a compiler based technique of RF bank-
ing for VLIW processors is presented. They present a two bank model, one with large number of
ports and other with less number of ports. The registers in two banks may be mutually exclusive or
may be present in both banks while the consistency is managed by the compiler. In a similar effort,
Nalluri et al. [2007] suggest register file banking for a RISC architecture, where compiler assigns
most accessed registers to the smallest bank and rest to the other banks.
6
1.2 Previously Proposed RF Architectures
Clustered Architectures
In clustered architectures each FU port cannot access all the RF banks. Because of limited connec-
tivity, RF to RF interconnection network is required. Clustered RF architectures have been observed
in both superscalar as well as in VLIW processors. Superscalar processors manage the inter-RF
communications using hardware mechanisms [Palacharla et al., 1997; Yeager, 1996; Farkas et al.,
1997], while in VLIW architectures, compiler inserts explicit instructions to copy an operand from
one RF to other RF when required [Capitanio et al., 1992; Seshan, 1998; Faraboschi et al., 2000;
Gangwar, 2005].
Alpha 21264 has replicated registers in register files of each cluster [Kesseler, 1999]. A write is
broadcast while read is done from the local register file. This reduces read ports of each RF and
also allows using single port for read as well as write. Along similar lines, Gonzalez et al. [2004b]
suggest using three register files instead of one, which are used in different stages of the processor
pipeline and a write is broadcast to all the RFs. Due to different usage pattern, number of ports and
size of each RF is less than the centralized RF.
1.2.3 Two Level RF Architectures
Two level RF, also known as RF cache or hierarchical RF, reduces the number of registers in the RF
is in direct contact of FUs, while RF ports remain same. The other RF bank may have more number
of registers and can have more access time.
For superscalar architectures several possibilities of two level RF have been explored. Cruz et al.
[2000] discuss a two level RF architecture in which operands are read from first level, while the
results are written to both levels. To bring values from level two to level one, the authors suggest
prefetching and caching techniques. Balasubramonian et al. [2001] and Sangireddy [2004] suggest
that only level one RF is visible to reorder buffer and rename table (of superscalar architecture), and
proposed hardware copy values to level two RF if they are not required except in case of branch
mis-prediction. Reinman [2005] also introduces a similar idea of having two register files, operand
register file and speculative register file. The latter is used only in case of branch mis-prediction. In
7
1 Introduction
VLIW processor explicit move operation is required to copy a register from one level RF to other
RF [Zalamea et al., 2000].
1.3 Proposed Solutions: Local Buffers Based RF Architecture
We propose an architecture with two levels of register files, where the first level is partitioned into
a number of banks called local buffers that are associated with each FU or issue slot and the second
level is a monolithic RF. With two level organization we reduce the number of accesses to level two
RF. Local buffers at first level localize the FU-RF interconnection avoiding the disadvantage of the
banking solution. With the reduced number of access to second level RF, we proposed a reduced
port architecture for second level RF.
The proposed architecture has advantages of most previously proposed architectures. Local
buffers being connected to each FU, avoid RF-FU interconnects for level one RF. First level RF
is scalable as it is distributed. Energy is reduced due to fewer accesses to the second level RF. Port
reduction of second level RF further reduces the energy consumed and leads to a scalable RF.
With the two level RF and partitioning at the level one, there are several possibilities of intercon-
nections. We explore these possibilities to arrive at a low energy solution. To further reduce the RF
energy, we reduce the number of ports of second level RF. For port reduction, RF ports are shared
among FU ports. We study RF-FU interconnects and propose direct interconnects approach that is
least energy consuming yet performance effective.
In the VLIW architectures, the order of execution is planned by the compiler, so the compiler
has an important role in all architectural optimizations. We study the required compiler support
and propose scheduling and binding algorithm which improve the performance and effectiveness of
the proposed RF architecture. Our scheduling and binding algorithm (1) increases the utilization of
local buffers, (2) minimizes the performance loss due to port conflicts caused by reduced port RF,
(3) efficiently binds operations to FU in presence of given RF-FU interconnects.
We also propose a theoretical model of the performance of the reduced port RF architecture.
The model takes the characterized application and processor architecture as input and estimates
8
1.4 Thesis Outline
execution cycles and energy. Issue width, number of RF read ports, and number of RF write ports
describes the processor. The model is validated against the simulation of various benchmarks over
various issue width processors. The model also establishes the scalability aspect of the proposed
RF architectures.
The scheduling and binding algorithm is implemented using Trimaran compiler framework [Chakra-
pani et al., 2005]. Various benchmarks of Mediabench and MiBench are used for experiments. With
experiments we show the RF energy reduction and the scalability of the proposed architecture. The
main contributions of thesis are following:
1) Proposed, analyzed and explored a new RF architecture with local buffer at first level and reduced
port RF at second level.
2) Proposed scheduling and binding algorithms for the proposed architecture that optimizes energy
and performance of applications.
3) Developed theoretical models for performance and energy estimations of the proposed architec-
ture.
4) Demonstrated scalability of the proposed architecture.
1.4 Thesis Outline
The rest of the thesis is organized as follows, Chapter 2 discusses the proposed RF architectures in
detail. The local buffer based RF architecture and reduced port RF architectures are discussed in
this chapter. Compiler aspects of the proposed RF architecture and scheduling binding algorithms
are discussed in Chapter 3. Performance and energy models are discussed in Chapter 4. In Chapter
5 we validate our model and evaluate the proposed architecture with experiments on a fixed issue
width processor. In Chapter 6 we vary the issue width of processor and show the scalability of
proposed architecture. Conclusions and future work are discussed in Chapter 7.
9
1 Introduction
10
2 Proposed RF Architecture
A VLIW processor with N issue slots usually has a register file with 2N read ports and N write ports
to support the following:
• Each of the N concurrent operations can simultaneously read two operations and write one
result.
• A value produced by any FU can be read by any FU.
• A value written in any cycle can be read after any amount of delay.
However, the real situation is not as demanding and the design can be simplified with a view to
reduce the chip area and/or power consumption. In this chapter we examine the typical demands
imposed on the VLIW RFs and present a energy saving architecture.
2.1 Motivation
It has been observed that most values stored in the RF have a short lifetime confined to a basic block.
A ‘use’ of a value is defined as local read if the value is defined in same basic block, otherwise it
is called a non-local read. Similarly, if all uses of a definition are in the same basic block then the
definition is termed local write, else the definition is non-local write. For example, Fig. 2.1 shows a
high level statement and corresponding assembly level code. In this code, variables Aaddr, Baddr,
Caddr, A, B, and C are produced (defined) and consumed (used) and are not used later, thus have
short life time in the RF. Such variables need not be stored in the RF if a local and temporary storage
is available.
11
2 Proposed RF Architecture
A High level code example:
C[x] = A[y] + B[z]
Assembly level code:
1. Aaddr <- Abase + y2. Baddr <- Bbase + z3. Caddr <- Cbase + x4. A <- load Aaddr5. B <- load Baddr6. C <- A + B7. store C Cadd
Figure 2.1: Example of short life time of variables.
To generalize this observation, we perform an experiment with a set of Mediabench and Mibench
benchmarks. We use Trimaran [Chakrapani et al., 2005] and its simulator to obtain the information
about local reads and writes. The liveness information present in the compiler is used to mark a
read/write as local or global. From the experiments we observe that on an average only 44% reads
and 26% writes are global (Fig. 2.2); the remaining reads and writes are local.
Local reads/writes allow compiler to (a) schedule read and write such that read is available di-
rectly from the output of FUs, (b) compute number of cycles between read and write and order reads
and writes. (a) leads to chaining of operations in the schedule. Gangwar [2005] observes that there
is a large number of long chains of operations available in most applications that can be mapped to a
set of FUs (called cluster). To exploit this locality we suggest physical local buffers associated with
each FU. Associating buffers with each FU distributes the storage and leads to scalable design. (b)
helps in designing these buffers such that input or output can be serial, which simplifies the design
of buffer.
To find the approximate size of these buffers, we performed another experiment. We calculated
the number of operand reads from RF within n cycles of write for different values of n. To find these
12
2.1 Motivation
Global readsGlobal writes
0%
20%
40%
60%
80%
100%
basi
cmat
hdi
jkst
rabi
tcou
ntbl
owfi
shFF
Tpa
tric
iaqs
ort
sha
g721
enco
deg7
21de
code
gsm
deco
degs
men
code
unep
icra
wca
udio
raw
daud
iope
gwitd
ecpe
gwite
ncA
vera
ge
Per
cent
age
glob
al r
eads
/wri
tes
Figure 2.2: Percentage global values accesses in various applications.
values, benchmarks were compiled, and simulated in Trimaran infrastructure for different issue
widths. The results are shown in Fig. 2.3 as cumulative number of reads represented as fraction of
total reads. The figure shows that after two cycles, increase in number of reads diminishes. In other
words, if each result value is kept in local buffers for two cycles, most local reads will be from local
buffers. The number of cycles for which a result value is there in buffer represents the depth of local
buffer. The experiment also suggests that depth of the buffer can be made small (for example, ‘2’),
without any significant loss of performance. We redefine the local reads as those which are read
from local buffers. A write is redefined as local write if all its uses are read from local buffers. All
other reads and writes are non-local reads and non-local writes, respectively.
The values which are not read from local buffers (non-local values) are read from second level
register file. Second level register file is a monolithic register file with 2N read ports and N write
ports, as there is a 1-to-1 connection between the FU ports and the RF ports. However, the average
RF port usage per cycle is less than 3N, because some of the operations like mov have only one
13
2 Proposed RF Architecture
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16 18 20
Cum
lativ
e nu
mbe
r of
rea
ds w
ithin
n c
ycle
s
Cycle after write (n)
16 Issue8 Issue4 Issue
Figure 2.3: Cumulative number of reads for different number of cycles afterwrite.
operand to read, while in some other operations, immediate operands do not require a register access
and memory store operation does not write back in the RF. Also, the fact that the average instruction
per cycle (IPC) for an application is less than peak IPC also contributes to lower average usage of RF
ports per cycle. Further, with local buffer based architecture, the RF read and write traffic ought to
observe a good reduction. Figure 2.4 shows the effect of these factors on average RF port usage for
a set of high ILP benchmarks. The benchmarks were executed using Trimaran compiler framework.
Figure 2.4 shows the difference between peak read port requirement and port requirement due to
average parallelism (given by 2*average IPC). Average port usage is even lower due to fewer ports
requirement of certain operations. Average RF port usage further reduces with operands available
in local buffers. For example, for a 6 issue processor, on an average 3 read ports are used though
the maximum port usage is 12 read ports. 12 read ports, in this case, clearly waste the available RF
bandwidth. Therefore, ports of the second level RF can be reduced.
In summary motivations for local buffer based architecture are the following:
• Local reads and writes are captured in local buffers.
14
2.2 Local Buffer Based Architecture
0
4
8
12
16
20
24
2 4 6 8 10
RF
rea
ds p
er c
ycle
Number of FUs (N)
Maximum RF reads(2N)2 * Average IPC
Average RF readsAverage non-local reads
Figure 2.4: RF read port usage.
• Distributed nature of local buffers increases scalability.
• Small size of buffers ensures that the clock period is not stretched.
• The traffic to the second level RF is reduced.
• Second level RF has reduced ports.
Based on these motivations we propose one local buffer to be associated with each input or
output port of a FU. We propose various possibilities of the local buffer RF architecture in detail,
and discuss their architectural impacts.
2.2 Local Buffer Based Architecture
We propose one local buffer to be associated with each input or output port of a FU as shown
in Fig. 2.5. Buffers and FUs are connected with an interconnection network which depends on
15
2 Proposed RF Architecture
From FU Outputs
To FU Inputs
Interconnection Between Buffers
Buffers Associatedwith fu’s output ports
(a) Local buffers associated with output.
From FU Outputs
To FU Inputs
Buffers Associatedwith fu’s input ports
Interconnection Between Buffers
(b) Local buffers associated with input.
Figure 2.5: Base local buffer model.
input/output behavior of buffers. Buffers associated with input FU port are called operand buffers
and those associated with FU output ports are called result buffers.
In operand buffer based architecture each FU input port is connected to a dedicated operand buffer
and FU always reads from it. Operand address in these buffers may be fixed or variable. When
operand address is fixed, all values in a buffer must be read from the same address sequentially
while in variable operand address case, any value can be read directly. As values can be written to
any address of the operand buffers, these buffers are referred as ‘random in sequential out’ (RISO)
and ‘random in random out’ (RIRO) operand buffer, respectively.
Similarly, in case of result buffers, results are written to fixed register locations of the buffer or to
random locations of the buffer, while values can be read from any location. These buffers are called
‘sequential in random out’ SIRO results buffers and ‘random in random out’ RIRO result buffers,
respectively.
Among all these possibilities, conceptually there is also a possibility of ‘sequential in sequential
out’ SISO operand or result buffers. These buffers may be considered as perfect FIFO or queue
where results are written in a predetermined order, and operands are read in the same order. As the
16
2.2 Local Buffer Based Architecture
Register
File
FU1
FU2
FU3
RISO (r1)
RISO (r2)
RISO (r3)
RISO (r4)
RISO (r5)
RISO (r6)
Figure 2.6: RISO buffer based VLIW Architecture.
order of production and consumption of operands is rarely identical, these buffers are of least use,
though they may be used in some specific application or architecture. For example, in ASIC design
FIFOs have been used for efficient synthesis [Balakrishnan and Khanna, 2000].
Next we discuss the characteristics and behavior of architectures with RISO, SIRO, and RIRO
buffers.
2.2.1 RISO Operand Buffers
In architectures with RISO buffers, one buffer is associated with each FU input port. In each cycle,
FU reads the operand from a predefined register (usually first) of its RISO buffers. The remaining
contents of the buffer are shifted by one address location as in a shift register. FUs can write to any
location in the buffer. As the operands are always read from the RISO buffer at predefined address,
it is not necessary to specify operand addresses in the instruction format. In other words, only write
result is needed in the instruction. For three FU VLIW processor, datapath with RISO buffers is
shown in Fig. 2.6.
For the correct execution of instructions, FUs write their results to a particular RISO buffer and
at pre-calculated address. To disambiguate, each RISO register gets a unique address and RISO
17
2 Proposed RF Architecture
address space is different from the RF address space. For RISO buffer with depth k, the time
difference between production of a result and its consumption by the other FU should be less than
or equal to k. If the time difference is more than k, the values are stored in second level RF. If there
is a branch operation between these producer and consumer instructions, then too the values are
stored in second level RF.
Operands that are available in the second level RF, are first written to RISO buffers as the FU reads
their operands only from RISO buffers. For reading a value from register file, a move operation
(movb) is used. We observed that in this architecture, an instruction requires only the result address
field. However, some writes are to RISO buffers, while other writes are to the second level RF.
Therefore, we propose to use two result address fields for each kind of write. This instruction format
imposes an additional constraint. When a result value may be used by more than one operation, in
that case also the result is written to the RF and accessed from the RF. Instruction format of two
types of operations in RISO architecture are shown below:
OPCODE <buffer address> <RF address>
MOVB <buffer address> <RF address>
It may be noted that requirement of movb instructions and constraint of single RISO buffer desti-
nation are not characteristics of RISO buffers but they are due to the suggested instruction format.
However, with a different instruction format other issues may arise. For example, multiple RISO
buffer destinations in instruction will lead to additional instruction bits. Still, it depends on applica-
tion that how many RISO buffer destinations will be sufficient.
Fig. 2.7 shows the example instructions for a RISO based architecture. These instructions are the
result of ASAP (as soon as possible) scheduling of the assembly code given in Fig. 2.1. Operations
that can be executed in parallel are separated by semi-colon. It may be noticed that the 0th instruction
contains movb operations that copy operands in RISO buffers. In the example code, the RISO buffer
address is the concatenation of buffer name (as shown in Fig. 2.6), and register number in that buffer,
and opcode also reflects the FU binding. Values are written onto register r1 1 in 0th instruction
which is consumed by the first instruction. The values written in register r1 1 by the first instruction
18
2.2 Local Buffer Based Architecture
is consumed in the next instruction, and so on. Similarly, in instruction 1 the result of ADD.3 is
written in r6 3, and it is read in instruction 4.
RISO Code0: MOVB r1_1, Abase; MOVB r2_1, y; MOVB r3_1, Bbase; MOVB r4_1, z;MOVB r5_1, Cbase; MOVB r6_1, x;1: ADD.1 r1_1, X; ADD.2 r3_1, X; ADD.3 r6_3, X2: LOAD.1 r1_1, X; LOAD.2 r2_1, X;3: ADD.1 r5_1, X;4: STORE.3;
Figure 2.7: An example of instructions for RISO based architecture.
Hardware Implementation of RISO Buffers
There are two implementations possible for RISO buffer. In one, the values are shifted every cycle.
This implementation is performance effective, but costly in terms of energy as in every cycle each
buffer location is written.
In other implementation contents of each buffer location is not shifted every cycle. It uses a
modulo t counter based address generator, where t is the depth of RISO buffer. Figure 2.8 shows the
detailed view of a RISO buffer of depth 3. A modulo address counter controls Mux2. Mux1 selects
the input from various FUs and the RF. Number of inputs to Mux1 is N +1 and in Mux2 it is equal
to the depth of the RISO, where N is the number of FUs. It may be noticed that Mux2 is not there in
shift register based implementation of RISO buffer. Therefore, Mux2 represents a trade-off between
performance and energy efficient implementation.
The architecture with RISO buffers provides a path from FU outputs to FU inputs. In other words,
there is implicit full bypass present in RISO architecture.
19
2 Proposed RF Architecture
Register
File
lines
Mux2 FU2
FU3
Mux1
Write ports
FU1
Address
Figure 2.8: Detailed RISO Architecture.
2.2.2 SIRO Result Buffers
In SIRO result buffers based architecture, there is a buffer associated with each FU output port.
Output of an FU is written to a predefined address of the associated buffer. Any FU can read from
any SIRO buffer and values can be read from any location in the buffer.
Similar to RISO buffers implementation, there can be two implementations of the SIRO buffer.
One in which values are shifted in each cycle and the other in which there is a modulo k counter for
the address generation for SIRO buffer of depth k. In case of architecture with SIRO buffers, we
prefer the first arrangement, as the pipeline registers of architecture can be reused. In the processor
pipeline, if there are pipeline stages between execution stage and writeback stage, those registers can
be treated as SIRO registers. If the depth of SIRO buffer is more than the available pipeline registers
then the extra registers can be used, which will act like a shift register. The architectural view and
detailed view of the architecture with SIRO buffer is shown in Fig. 2.9 and 2.10, respectively.
As all the FUs can read the values of each SIRO buffer, the structure forms complete bypass net-
work with bypass depth of k + 1. SIRO buffer implementation also suggests a fast implementation
of bypass network with bypass depth more than one. In the implementation suggested in Fig. 2.10,
20
2.2 Local Buffer Based Architecture
Register
File
FU1
FU2
FU3
SIRO (s1)
SIRO (s3)
SIRO (s2)
Figure 2.9: SIRO buffer based VLIW Architecture.FU
2FU
3
Mux1
File
Register
Mux2
FU1
Figure 2.10: Detailed SIRO Architecture.
the input of Mux1 is increased by one with respect to traditional bypass design with bypass depth
one.
21
2 Proposed RF Architecture
Similar to RISO buffer architecture, here too compiler generates operand addresses. A unique
address is assigned to all the elements of SIRO buffers and the compiler while scheduling decides
whether operands are read from SIRO buffers or from the RF.
FU can read operands from any SIRO buffer or register file. In other words, each SIRO buffer
provides operands to all FUs. Therefore, number of read ports in a SIRO is 2N where N is number
of function units. It may appear that a SIRO architecture needs no destination address field in the
instruction, but the field is required for values to be written to register file. It may be noted that
all non-local writes need to be written to the RF. The FU reads operands from either the register
file or from SIRO buffer. RF and SIRO have different address spaces; thus, a separate field for RF
address and SIRO address is not required in the instruction. Due to this, instruction format of the
architecture with SIRO buffer remains the same as that of conventional architectures.
SIRO Code0: ADD.1 s1_1, Abase, y; ADD.2 s2_1, Bbase, z; ADD.3 s3_1, Cbase, x;1: LOAD.1 s1_1, s1_1; LOAD.2 s2_1, s2_1;2: ADD.1 C, s1_1, s2_1;3: STORE.1 s1_1, s3_3;
Figure 2.11: Example instructions for SIRO based architecture.
The ASAP schedule corresponding to the assembly code of Fig. 2.1 for a SIRO architecture is
given Fig. 2.11. We may observe that the instruction format remains the same as for conventional
VLIW architectures. In this example, all the results are written to the first register of the correspond-
ing SIRO buffer because there is no need to write them into the second level register file.
2.2.3 RIRO Buffers
RIRO buffers associated with read and write ports of FU are called RIRO operand buffers and RIRO
result buffers, respectively. Due to random access for both read and write these buffers act as small
22
2.2 Local Buffer Based Architecture
register files. For read as well as write to a RIRO buffer, the compiler must provide the appropriate
address. The addresses of RIRO cannot be generated in hardware as in RISO or SIRO buffers.
In architectures with RIRO operand buffers, a RIRO buffer is associated with each input port of
an FU. The FU can read any register into its RIRO buffer, and can write to any location in any RIRO
buffer. Structurally an architecture with RIRO operand buffer is similar to architectures with RISO
buffers (Fig. 2.8). The only difference is that the control of Mux2 is generated by RIRO address
instead of a modulo counter. Unlike RISO buffer, operand address is required for a RIRO operand
buffer. Therefore, the instruction format would contain two source operand addresses pointing to
RISO buffers and two result addresses – one pointing RISO buffer and other pointing the RF. If both
are not needed simultaneously, then it is only an extra bit.
In architectures with RIRO result buffers one RIRO buffer is associated with each output port
of the FU. Structurally, the architecture is similar to the architecture with SIRO buffers. There is
no shifting of data required and therefore, pipeline registers cannot be re-used for design of RIRO
result buffers. Each FU writes to its RIRO operand buffers, and if required it writes to the RF as
well. The input operands of FUs are read from either RIRO result buffers or from the RF. Instruction
encoding includes two destination addresses, one for register file and another for RIRO buffer.
Because of random access principle, RIRO buffers do not have the limitation on temporal locality
of definition and use of values. Therefore, conceptually, the architecture with RIRO buffer need not
have a global register file. Without global register file, the required size of each RIRO buffer would
be large. As the results produced by one FU may be used by more than one FU, therefore, it is
essential to have a communication mechanism between RIRO buffer to other RIRO buffer. Access
time would also increase due to both these factors. The other implication of large access time is
the absence of implicit RF bypass, and therefore, an explicit bypass network would be required.
This RIRO architecture without second level RF is a special case of clustered VLIW architecture
[Capitanio et al., 1992]. For example, in case of RIRO operand buffer only one FU can read from
a RIRO buffer while all others can write to it, which is similar to clustered VLIW architecture with
write across interconnects [Gangwar et al., 2007]. Similarly, RIRO result buffers are equivalent to
23
2 Proposed RF Architecture
read across clustered VLIW architecture.
If we relax certain interconnection constraints then the RIRO architecture is similar to many
previously proposed architectures. For example, if all FUs can read and write to any RIRO buffer,
then it will be like banked-RF architecture with full interconnect [Balasubramonian et al., 2001]. If
a RIRO buffer is shared between multiple FUs, than it is similar to clustered architecture.
2.2.4 Qualitative Analysis
All the three architectures mentioned above fulfill our goal of capturing temporally local variables
and avoiding RF access in such cases. The architectures also physically distribute the local buffers
in favor of physical routing.
RIRO buffers are the most lucrative alternative from a performance point of view. They have
the advantages of RISO/SIRO and because of random access, they can use their size most effi-
ciently. Additionally, use of a value in RIRO buffer is not limited to the boundaries of a basic block.
However, compiler development for RIRO based architecture is extremely complex, as it involves
simultaneous register allocation, register file partitioning, operation scheduling, and FU binding.
As this is first effort in the direction of local buffer based architectures, we chose a subset of the
available design space.
RISO and SIRO based architectures are equivalent on most grounds. There are some restrictions
in RISO based architectures which lead to lower performance and lower energy. For example, in
RISO each second level RF read requires a movb operation. Therefore, an additional 30-40% of
original instructions would be required. Also, results which are required by multiple FUs need to
be routed through the second level RF. However, as indicated before, these restrictions are due to
design choices, not fundamental properties of RISO buffers. We prefer to use SIRO architecture
for further exploration because SIRO buffer has the additional advantage that it is most similar to
conventional VLIW architecture with full bypass network.
In next section we discuss the similarities of SIRO based architecture and conventional VLIW
architecture.
24
2.3 SIRO Buffers and Conventional VLIW Architecture
2.3 SIRO Buffers and Conventional VLIW Architecture
2.3.1 SIRO Buffers and RF Bypass
In a conventional pipelined processor RF bypass paths are provided to avoid data hazards. Operands
are read from the RF as well as from the pipeline registers. If a value is available from the pipeline
registers, the RF value is discarded. In other words, RF is bypassed for reading operands which are
available in the pipeline paths. A value is available from the RF only after it is written to the RF.
The number of stages from where bypass paths are provided is known as bypass depth, d.
SIRO based architecture is similar to this classical design of VLIW processor. In SIRO buffer
architecture also, the results of FUs are stored in SIRO buffers and shifted in each cycle. If each
SIRO register is replaced by pipeline register, and depth of SIRO buffer is equal to the bypass depth,
SIRO based architecture has the same datapath as that of a conventional VLIW architecture.
SIRO buffer depth cannot be less than the bypass depth, as it will lead to data hazards. In case
SIRO buffer depth is more than the bypass depth additional registers would be required to implement
the SIRO buffer. Number of such registers would be the difference of the buffer depth and the
bypass depth. An example architecture with bypass depth of 2 is shown in Fig. 2.12(a). FU’s input
multiplexer gets inputs from RF, execute stage, and writeback stage. If a RISO buffer with depth 4
is to be implemented in this processor, two additional registers (4−2) are required as shown in Fig.
2.12(b). Similar to Fig. 2.10, Mux2 multiplexes outputs from three registers of SIRO buffers.
2.3.2 Advantages of SIRO Buffers
SIRO based architecture has a number of advantages over conventional VLIW architectures. First,
RF accesses are reduced in redundant cases; second, bypass control become simpler; third, multiple
stage bypass gives performance advantages.
25
2 Proposed RF Architecture
Decode Stage Execute Stage WritebackStage
File
Register
(a) An architecture with bypass depth 2.
Decode Stage Execute Stage WritebackStage
File
Register
Mux2
FUMux1
(b) An architecture with SIRO buffer of depth 4.
Figure 2.12: SIRO buffers and RF bypass.
RF Access Avoidance
In conventional VLIW processors values are always read from RF and are discarded if operands are
available in the bypass network. This happens because information that a value exists in the bypass
network is not available before reading from RF. RF read stage and bypass control computation are
done in parallel. In SIRO based architecture, due to address decoding of SIRO registers, the RF read
is avoided easily.
Each register of SIRO buffer is uniquely addressed. Equivalently, all the pipeline registers from
which bypass can be read are addressed. The address space of the SIRO buffers is also mutually
exclusive to the RF address space. Therefore, when the operand address is a SIRO address, the RF
is not read. Similarly, when the result address is a SIRO address, it is not written to the RF. If the
result address points to the RF, the RF as well as SIRO buffers are written onto.
26
2.3 SIRO Buffers and Conventional VLIW Architecture
Addr m
N suchmuxes
m
m:1 mux
32
32
Addr i
Addr 1
Addr i
m*N AddressComparisons
Operand
Selection
(a) Original bypass control.
m
RF inhibit bit
m:1 muxN suchAddr decoder
output
m
N suchdecoders
m 32
32
muxes
(b) Modified bypass control.
Figure 2.13: Bypass control for one operand of a functional unit.
Bypass Control Simplifications
In conventional bypass control circuits, for each FU input, depth*N address comparisons are re-
quired to know if the operand is available in any of the bypass paths for N issue processors.
Therefore, for N issue processor the number of comparators required in bypass control circuit is
2 ∗ depth ∗N2. As the issue width increases the complexity of bypass circuit also increases. In
general, if number of possible bypass sources for the ith operand is m, its bypass control is shown in
Fig. 2.13(a).
In SIRO buffer based architecture, each pipeline register from where bypass value is expected, is
addressed. Control unit for bypass, in this case, consists of a decoding circuit which expands the
information given for each operand, which make bypass control simpler. Now instead of 2∗depth∗
N2 comparators, 2∗N decoders will be required as shown in Fig. 2.13(b).
To inhibit register read/write a signal is generated when an operand is read from bypass. This
signal is generated by a combinational circuit which is based on the address space used by SIRO
registers.
27
2 Proposed RF Architecture
Multi-stage Bypass
We showed that we can increase the SIRO buffer depth with minimal impact on clock cycle delay.
The implementation of the SIRO buffer also suggests a way realize multi-stage bypass control which
may be helpful in deep processor pipelines.
2.4 Reduced Port Second Level RF
In architectures with local buffers, non local reads and writes are provided by second level mono-
lithic RF. Each FU is directly connected to the second level RF; therefore, 2N reads and N writes
ports are required for N parallel operations. As suggested in the beginning of this chapter, an RF
with such a large bandwidth is not required and we propose to reduce the number of RF ports by
sharing them among FUs.
Reduction of ports may lead to conflicts (we call these port conflicts) causing an increase in the
execution time. In shared port RF architecture, RF-FU connections are no longer 1-to-1. Depending
on the structure of the RF-FU interconnection network, there may be additional conflicts (we call
these path conflicts) due to lack of interconnection paths, even though ports may be available. In
this section we present an approach to keep both these kind of conflicts within acceptable levels,
while keeping the benefits of energy saving.
The RF-FU interconnection problem is independent of the presence of SIRO buffers, as the FU
reads and writes to the second level RF directly. SIRO buffers help in reducing the traffic to the
second level RF.
2.4.1 Processor Design with Shared Port RF
With the shared port register file, only register read and write stages of the processor pipeline are
modified, while the rest of the processor design remains unaltered. In the conventional VLIW ar-
chitecture, FU and RF port have one-to-one connectivity. In commercial architectures like velociTI
[Seshan, 1998] and Lx [Faraboschi et al., 2000], the RF ports are shared by a number of function
28
2.4 Reduced Port Second Level RF
units in the same issue slot. The function units in a particular issue slot form a composite unit which
can be viewed as a single FU at the architectural level.
In the proposed shared port RF architecture, one RF port may be connected to more than one FU
port. Therefore, an interconnection network is required to map an RF port to FU port and operand
address to RF address port. The mapping logic and the access energy of the RF read/write depends
on the RF-FU interconnection network.
RF-FU Interconnection Network
Register
File
1
2
4
3
FU1
FU2
FU3
FU4
(a) Complete interconnection.
File
1
2
3
4
FU1
FU2
FU3
FU4
Register
(b) Direct interconnection.
Register
File
1
2 FU2
FU3
FU4
3
4
FU1
(c) Partial interconnection.
Figure 2.14: RF-FU interconnection topologies for shared ported register file(Bypass paths and SIRO buffers are not shown to simplify thefigure).
We classify different interconnection topologies into three classes : complete, partial, and direct.
The complete interconnection networks form one extreme, in which every FU port is connected to
every RF port (Fig. 2.14(a)). On the other extreme are direct interconnection topologies in which
each FU port is connected to only one RF port (Fig. 2.14(b)). There are numerous partial inter-
connection topologies in between these two extremes (Fig. 2.14(c)). The direct interconnection
29
2 Proposed RF Architecture
networks require the least multiplexing and offer the least RF access energy and access delay. On
the other hand, the complete interconnection eliminates path conflicts by providing all possible
interconnection paths, thus offering the best performance.
In the direct interconnection, each FU port is connected to a single RF port, therefore, FU bindings
determine the RF port mapping. In case of complete interconnection, port mapping information
may be generated by the compiler and passed to the hardware, but it requires additional bits in the
instruction. The additional bits will result in increase in code size and significant modification in
the fetch stage of processor. The alternative is to do port mapping in hardware. Therefore, in case
of the complete interconnection the compiler just ensures that the number of RF reads and writes in
a cycles is less than or equal to the number of available ports in the RF.
An example of a direct interconnect is shown in Fig. 2.15. Multiple address inputs are multiplexed
in accordance with interconnection topology. Outputs of the RF need no extra multiplexing. The
multiplexing at the input end adds to the delay but it is more than compensated by decrease in the
RF access time due to port reduction.
SIR
OSI
RO
SIR
OSI
RO
1
2
3
4
FU1
FU2
FU3
FU4
1
2
3
4
op1_addr1
op1_addr2
op2_addr1
op2_addr2
op3_addr1
op3_addr2
op4_addr1
op4_addr2
Register
File
Figure 2.15: Direct interconnection.
30
2.4 Reduced Port Second Level RF
In multi-issue processors, different issue slots are usually homogeneous for most frequent opera-
tions, while less frequent operations are available on fewer function units. For example, in Lx archi-
tecture, integer operations can be performed by all FUs while memory operations can be performed
by one in four FUs. Considering homogeneity of FUs, a careful selection of the interconnection net-
work makes more FUs available as binding options to the scheduler. Therefore, the path conflicts
can be kept fairly low, resulting in a performance close to that of complete interconnection, while
retaining the advantages of the simple hardware and multiplexer less connections.
Choosing Direct Interconnection Matrix
Direct interconnection matrix can be defined by P. P1i j, P2
i j is one if there is a path from the ith RF
read port to the first and second inputs of the FU j, respectively. Pwi j is one if there is a path from the
ith write port of the RF to the output of FU j.
There are MN possible direct interconnection matrices, where N and M are the number of FU
ports and RF ports, respectively. MN also includes the matrices that uses less than M RF ports. In an
RF, all ports are symmetrical and every pair is equivalent. This property of RF reduces the number
of candidate interconnection matrices significantly. The number of unique interconnection matrices
for a given M and N is an order of magnitude less than the total number. Further, the choice can be
narrowed down using the following guidelines:
(i) The two read ports of each FU are connected to different RF ports. This is a necessary
condition for a valid interconnection network.
(ii) Each RF port should be connected to approximately equal number of FU ports such that the
resulting interconnect is balanced. An imbalanced interconnection network leads to more
path conflicts.
(iii) We have observed that on an average the left port of an FU is used four times more than the
right port. Therefore, we suggest that an RF port can be shared with the left read port of one
FU and the right read port of another FU for more balanced sharing. In other words, left ports
31
2 Proposed RF Architecture
and right ports should separately be distributed among RF ports, as uniformly as possible.
The above guidelines assume that most frequent operations can be executed by all FUs of the
processor. Using the guidelines it turns out that the number of matrices which satisfy the guidelines
is very small. The total number of possible matrices is MN , and the number of those which use all
M RF ports is MN − (M−1)N . The number of unique interconnection matrices, taking into account
symmetry of the RF ports, is given by Stirling number of second kind. Stirling numbers of the
second kind S(N,M) count the number of ways to partition a set of N elements into M nonempty
subsets [Baghdadi et al., 2000; D’Antona and Munarini, 2000]. The number of interconnection
matrices that follows the first, second, and third guideline can be found by enumerating the subsets
given by Stirling numbers of the second kind and applying the given constraints.
For example, for 8 FU ports and 4 RF ports, 65536 interconnection matrices are possible, out of
which 40824 uses all 4 RF ports and only 1701 (S(8,4)) are unique. The matrices that follow first,
first two and all three guidelines are 652, 60 and 9, respectively. Any interconnect that follows the
given guidelines leads to less performance impact due to path conflicts. A few examples of such
interconnects are given in Fig. 2.16. In these examples, 4 RF read ports are shared between 8 FU
input ports. Each RF port is connected to two FU input ports; one of them is the left input port and
the other is the right input port, to ensure homogeneity. The figure also shows the corresponding
interconnection matrices. Each row of interconnection matrix indicates the RF port to which a FU
port is connected.
2.5 Issues with the Proposed Architectures
The proposed architectures are based on the fact that the values available in SIRO buffers are directly
used by the FUs requiring them, bypassing their read and write in the RF. Also, the instruction points
to the registers which are temporary in nature, i.e., shifted to the next register in each cycle. Due to
these properties of our architecture, a few issues arise which are discussed now.
32
2.5 Issues with the Proposed Architectures
23
1
4
3
243
2
1
3
4
FU1
FU2
FU3
FU4
File
Register
1
2
3
4
FU1
FU2
FU3
FU4
File
Register
1
2
3
4
FU1
FU2
FU3
FU4
File
4
(b)
2143
2431
23
1
Register 2
1 1
4
(a) (c)
Figure 2.16: Example direct interconnects and corresponding interconnectionmatrices.
Performance Impact
SIRO registers address space is disjoint with RF address space. The number of bits allocated to
operand address is fixed in an instruction. Sharing the address space given may reduce the available
general purpose registers and thus, increase the register pressure which may lead to more spill code
and therefore, performance loss. If SIRO buffer aware register allocation is used, the performance
loss is minimized. Yan and Zhang [2008] suggest such a register allocation approach that considers
the operands available in bypass paths as virtual registers. They show that using virtual registers
leads to a decrease in the register pressure. Therefore, relative performance would increase if local
buffer aware register allocation is used.
Predicate Operations
In VLIW processors predication is used for increasing basic block size and to increase parallelism.
Predicated operations are conditionally executed and the conditions are determined at run time.
Due to non-determinism at compile time, our RISO/SIRO based approach cannot be used with
predicated operations. For example, if x is a predicated operation, and operation y reads its result
33
2 Proposed RF Architecture
from SIRO buffer in the next cycle. If due to conditional execution operation, x doesn’t get executed,
operation y will not get the correct operand from the SIRO buffer and the execution will not be
correct. However note that in commercial processors predication is not generally used. For example,
Analog’s TigerShark, TI’s 320C6X, Sun Majc have no predication in their instruction sets. ST’s Lx
has only partial predication support.
Exceptions
If there is an exception raised by an instruction having operands to be read from SIRO buffers, or
there is an interrupt at this instruction, the operand value will not be in SIRO buffers after returning
from interrupt/exception unless some micro-architectural change is done in the processor design.
Exceptions and interrupts are handled in VLIW processors like any other pipeline processor with
the difference that we cannot change the addresses of operands in hardware, as all registers are
determined at compile time. Ozer et al. [1998] describe how the traditional techniques of excep-
tion/interrupt handing can be extended for VLIW processor. They propose to store a copy of the
register file in it and reorder buffer or history buffer. Whenever control gets back from exception,
the original state can be retrieved from either history buffer or future file with reorder buffer.
In our case any of these methods can be adopted, and SIRO buffer registers will be saved in
history buffer or future file as regular registers and transient values can be retrieved at any instant.
Extra logic is required to provide connection and control so that future file or history buffer can read
from (written to) SIRO buffer registers. Quantitative analysis of power and performance impact of
exception handling is not in scope of this work.
Pipeline stalls
In VLIW processors all the function unit latencies are exposed to the compiler. Thus the schedule
generated by the compiler is executed as it is in the hardware. Caches are an exception to this
rule. In case of memory access the compiler assumes a minimum access latency (assuming a hit).
When there is a data cache miss the whole pipeline is stalled till the data arrives. In SIRO buffer
34
2.6 Related Work
based architecture, whenever these pipeline bubbles are introduced, the hardware has to make sure
that pipeline registers donot lose their values. However, for instruction cache miss, a pipeline stall
usually leads to flushing of the rest of the pipeline. In our case, a signal is required for stalling all
the pipeline stages till we get a value from the I-cache. Our approach is not affected due to control
hazards, as the compiler analysis is done only within basic blocks.
Downward Compatibility
Code generation in the proposed architecture has to be done for a given architecture. Any additional
function unit in the architecture would mean a change in registers for SIRO buffers, which would
necessitate regeneration of code. Also, in case of direct interconnect, any change in number of
issue slots or FU placement inside issue slots may change the interconnection matrix. Changes in
interconnection matrix of direct interconnect would require re-scheduling of operations. Note that
rescheduling is not essential in case of reduced port RF architecture with complete interconnect
if the new architecture has more FUs or issue slots, though rescheduling would result in a better
performance.
2.6 Related Work
Local Buffer Architecture Related Work
Queue based hardware structures have been used earlier also in datapath of processor or ASICs.
FIFOs are used in RTL datapath synthesis by Balakrishnan and Khanna [2000]. A shift queue based
architecture is used in synthesizing ASICs for loop acceleration by [Fan et al., 2005; Schreiber et al.,
2002]. In shift queue architecture, size, ports and connection of the queue depends on the applica-
tion loop to be synthesized. Fernandes [1998] proposes FIFO based buffers for VLIW processors.
He proposed a modulo scheduling algorithm which allocates variables with equal life times and
different scheduling times to one FIFO buffer.
RISO architecture is also similar to transport triggered architecture (TTA) [Hoogerbrugge and
35
2 Proposed RF Architecture
Corporaal, 1994]. TTA architecture is based on operand transport rather than instruction execution.
The basic principle in TTA is, operations occur as by-products of operand transport, while in our
architecture, both operand and operation are explicit. In this way RISO based architecture lies
inbetween TTA and VLIW. In TTA architecture, register file and functional units were connected
by a bus which may be unscalable, whereas we use point to point connection. Another similarity of
buffer based architectures and TTA is, both architectures use reduced port register file [Corporaal,
1999].
Bypass Related Work
Researchers have used the presence of bypass network to save register file energy. The idea here is
to avoid reading operands from the register file which could be obtained from the bypass network.
The opportunities for avoiding register file read may be detected by hardware [Park et al., 2002]
or the compiler [Sami et al., 2002; Asanovic et al., 2002]. The hardware approach requires bypass
control computation before RF read. This may lead to either increased clock period or increase in
number of pipeline stages. Moreover, RF writes cannot be avoided in the hardware approach as the
liveness information of the operands is not available in the hardware.
In another direction of research, researchers tried to reduce the complexity of the bypass network
by reducing the number of bypass paths [Ahuja et al., 1995]. Important issues are, the selection
of bypass paths that can be removed and code generation to minimize the performance impact due
to missing paths. Fan et al. [2003] and Shrivastava et al. [2005] suggest methodologies to explore
the design space and select bypass paths for superscalar and VLIW architectures, respectively. Park
et al. [2006] and Kudlur et al. [2004] suggest scheduling algorithms to avoid performance impact
due to partial bypass network for superscalar and VLIW architecture, respectively.
Port Sharing
In a single issue processor, RF port sharing among multiple function units is a matter of rule, but
shared port RF architectures for multiple issue processors have received very little attention. In
36
2.7 Summary
VLIW processors, typically, port sharing has been used at the level of issue slots, i.e., function units
in a single issue slot share the same read and write ports [Seshan, 1998]. Our approach goes one
step further and permits sharing ports across issue slots. Aditya et al. [1999] proposed an approach
with very limited port sharing. In their approach, an FU having less port requirement shares RF
port with an FU having higher port requirement. Port reduction achieved is quite limited in their
approach.
There are some instances of reduced port RFs in other multiissue architectures. For example,
in [Park et al., 2002; Kim and Mudge, 2003; Sangireddy, 2007; Sirsi and Aggarwal, 2009], the
authors have suggested port sharing for superscalar processors. The focus here is on port conflict
management in hardware. However, for a VLIW processor, a compiler driven solution is required.
2.7 Summary
In this chapter we proposed local buffer based RF architectures for VLIW processors. The RISO,
SIRO, and RIRO architectures were studied in detail. We studied their architectural implications
critically and further explored the architectures with SIRO buffers. We showed the resemblance of
SIRO architecture with traditional full bypass network and presented mechanisms for avoiding RF
access when SIRO buffers are accessed. SIRO buffers are also a promising choice for large bypass
depth. The SIRO architecture potentially reduces the RF traffic. Based on this we proposed a
reduced port RF architecture. For reduced port RF architecture we explored RF-FU interconnection
network and proposed direct interconnection which is simple and has least hardware cost. For direct
interconnection, we gave guidelines that help in selecting the interconnection matrix.
37
2 Proposed RF Architecture
38
3 Code Generation for Proposed Architecture
3.1 Motivation
VLIW processors depend on their compilers for extraction of parallelism and resource management.
In other words, the compiler for a VLIW processor must correctly resolve all timing and resource
conflicts, and optimize the code for maximum parallelism.
Correctness is the minimum requirement for a VLIW compiler. For correctness, the detailed
VLIW architecture is made visible to the compiler, and resource constraints are given as inputs to
it. Compiler’s algorithms of operation scheduling, FU binding, and register allocation usually take
care of architecture constraints to produce correct code. For the proposed architecture, in addition
to resource constraints of a conventional VLIW architecture, the following additional constraints
are present:
(i) Operand and result addresses in the instruction should point to correct SIRO/RF address,
(ii) Number of RF reads and writes in an execution cycle should be less or equal to physical ports
of the register file, and
(iii) RF-FU paths should be available to FU whenever required.
Different constraints of the proposed architecture have impact on different modules of the com-
piler. The first constraint affects the register allocation; the second constraint affects the operation
scheduling; and the third constraint affects the FU binding. Further, operation scheduling and FU
binding affect each other, and thus, there is a phase ordering issue among them. As a workaround,
39
3 Code Generation for Proposed Architecture
we propose an algorithm which does both operation scheduling as well as FU binding simultane-
ously.
Further, several optimizations are possible to improve performance and reduce energy consump-
tion. For example, the number of reads and writes to SIRO buffers can be increased to reduce RF
energy; the operation schedule can be optimized to reduce the impact due to port conflicts and path
conflicts; register allocation can consider SIRO register address space to reduce register pressure.
In our thesis, we focus on operation scheduling and FU binding algorithms which not only meet
the correctness requirement, but also do performance and energy optimizations. Register allocation
is out of scope of this thesis. However, our solution of register allocation ensures correctness.
In the rest of the chapter, Section 3.2 defines the scheduling and binding problem and Section 3.3
discusses the scheduling binding algorithms in detail. Section 3.4 discusses the other modules of
the compiler assisting in code generation. Section 3.5 summarizes the chapter.
3.2 Scheduling-binding Problem and Methodology
The inputs to the scheduling and binding algorithm are an application program represented as data
flow graphs and some parameters representing the architecture. A data-flow graph is a directed
acyclic graph G(V,E), where each element of the vertex set V = {v0,v1, . . . ,vn} corresponds to an
operation, and each edge ei j ∈ E represents dependency of v j on vi. Edge weight di j corresponding
to ei j , is the minimum schedule time difference between vi and v j. Each vertex, vi is also associated
with a delay xi that the operation requires to execute. In this graph, vn is the sink node that has no
outgoing edge and has incoming edges from all other nodes. We assume that each operation v i may
require upto two operands and may produce a result, as denoted by r1i , r2
i and wi. Value of r1i , r2
i and
wi are one if vi reads operand 1, reads operand 2 and writes result, respectively, else the values are
zero.
The number of issue slots N, the number of RF read ports R, and the number of RF write ports W
are the architecture parameters considered.
The scheduling problem for reduced port RF architecture is to find the integer labellings of the
40
3.2 Scheduling-binding Problem and Methodology
operations ϕ : V → Z+ satisfying the following constraints:
(i) Schedule time satisfies the dependency constraints due to all edges in the graph.
ϕ(vi) ≥ ϕ(v j)+d ji ∀i, j s.t. e ji ∈ E. (3.1)
(ii) Total number of operand reads should be less then read ports of the RF for any schedule time.
∑i:ϕ(vi)= j
r1i + r2
i ≤ R ∀ j. (3.2)
(iii) Total number of result writes in any cycle should be less then write ports of the RF.
∑i:ϕ(vi)+xi= j
wi ≤W ∀ j. (3.3)
(iv) Total number of operations scheduled in a cycle should also be less than the issue width of
the processor.
∑i:ϕ(vi)= j
i ≤ N ∀ j. (3.4)
To reduce the impact on cost, we reduce the demand of RF read and write ports by using the
operands available in the SIRO buffers. As discussed in Chapter 2, if an operand is available in the
SIRO buffers, it is not necessary to read that operand from the RF. Similarly, if all the uses of a result
are read from SIRO buffers, we may avoid writing it in the RF. To decide if an RF read or write can
be avoided, the schedule information ϕ is required. In the graph G, each incoming dataflow edge to
a vertex corresponds to an RF read, and outgoing dataflow edges from a vertex corresponds to an
RF write. An RF read for operation vi, associated with a dataflow edge, e ji = (v j,vi), can be avoided
if
ϕ(vi)−ϕ(v j) ≤ d ji +depth−1, (3.5)
where depth is the SIRO buffer depth. If an RF read is avoided, the corresponding value of r1i or
41
3 Code Generation for Proposed Architecture
r2i is toggled to zero. RF write associated with vi can be avoided (and wi is set to zero) if the write
is not global and all the RF reads corresponding to dataflow edges starting from node v i follow
the condition (3.5). Global writes are the essential writes and are determined using global liveness
information (this will be discussed in 3.4.1).
3.3 Proposed Scheduling and Binding Algorithm
We use list scheduling as the base algorithm. In standard list scheduling, the scheduler begins with
a ready list, a set of nodes ready to be scheduled, and schedules them in order of their priorities
considering resource constraints. The priority determines which operation should be scheduled
earlier when the number of ready operations is more than the resources available.
To maximize the number of reads from SIRO buffer, we consider a priority function specific to
the proposed architecture.
3.3.1 Scheduling Priority Function
In list scheduling, as soon as an operation executes, its successor operations become ready. If the
number of ready operations is less than or equal to the FUs available, then all the ready operations
get scheduled in the next cycle. If dependencies are due to data-flow edges, atleast one operand will
be available from the SIRO buffers, as condition (3.5) will be satisfied. By this argument we can
say that list scheduling provides a good framework for developing a scheduling/binding algorithm
to optimize SIRO usage.
If the number of operations available for scheduling is more than the resources available, the
priority function decides which operation will be scheduled in the current cycle. Scheduling the
ready operations over multiple cycles may increase the difference between the production time and
the consumption time; therefore, usage of SIRO buffers may reduce. To maximize the number
of SIRO reads the priority function should also consider the availability of operands from SIRO
buffers.
In general, priority function in list scheduling is performance driven. We propose two approaches
42
3.3 Proposed Scheduling and Binding Algorithm
to modify the the current scheduling priority function. The first approach is conservative and the
second approach is aggressive for increasing the number of reads from the SIRO buffers.
Two Step Priority Function
In this approach the primary priority function is performance driven. We consider the distance from
the sink node as the primary priority function. A secondary priority criterion is used in case ready
operations have same primary priority values. We call our secondary priority function as bypass
priority. Bypass priority function returns higher priority for operations getting more input operands
from the SIRO buffers. Also, if an operation can get operands from the SIRO buffer in the following
cycles, then it has lower priority than operations that can get operands from the SIRO buffers only
in the current cycle. Overall bypass priority, Pby, is defined as
Pby = ∑i∈inputs
Pbyi (3.6)
where,
Pbyi =
tc − tpi +1 if tc − tpi < depth
0 otherwise(3.7)
where tc, tpi are the current time step and time step when the operand i will be computed. This
scheduling priority is conservative as it does not affect the original scheduling function.
Modification in Schedule Priority
In this approach, we increment the original schedule priority by δ if there is any possibility of getting
operands from the SIRO buffers. The value of δ is assigned in such a way that it just changes the
order of scheduling. For example, if distance from sink node is the original priority function, then
δ may be equal to minimum execution delay of any FU.
We found through experiments that the above mentioned small change in priority gives a sig-
nificant advantage in terms of decrease in the number of reads and writes from the second level
RF.
43
3 Code Generation for Proposed Architecture
Length
FUs
MaxScheduleLength
ScheduleMax Max
ScheduleLength
R W
RT RTFU RRT W
Figure 3.1: Reservation table (Issue, R, and W are the number of issue slots,the number of read and write ports; Max schedule length iscalculated for each compilation region).
3.3.2 RF Port Aware Scheduling and Binding
RF port aware scheduling algorithm takes care of new resource constraints. The concept of reser-
vation tables is used to maintain the list of resources available in each cycle. Beyond this, our
scheduling algorithm performs the operation to FU binding explicitly.
Algorithm 1 RPA sched()Input: G, MOutput: ϕ
1: t = 02: RT = init reservation table(M)3: while all nodes are not scheduled do4: ready list[] = get ready list(G, t)5: priority[] = get priority(list, G, M)6: bind set = get bind set(ready list, priority, RT, M, t)7: ϕ[bind set] = t8: t = t + 19: end while
Algorithm 1 shows the pseudo code of the core RF port aware (RPA) scheduling. The input to
the algorithm is an acyclic data flow graph G and the processor model M. M includes the number
of issue slots, number of read and write ports of RF, RF-FU interconnection network, operation
44
3.3 Proposed Scheduling and Binding Algorithm
mapping to function unit, X . First we initialize cycle time t to zero and reservation table (RT) to
available resources which includes FU, read ports and write ports (line 1–2). The reservation table
records the availability of resources in any cycle. The structure of the reservation table is shown
in Fig. 3.1. It has three parts, RTFU , RTR, RTW . RTFU [ f , t], RTR[r, t], and RTW [w, t] indicate the
availability of FU f , read port r and write port w, respectively, at time t. Bit corresponding to the
resource is toggled when the resource is not available. The main loop finds and schedules opera-
tions in the current cycle (line 3–9). For each cycle, a ready list is generated using get ready list()
function (line 4). priority is the operation priority corresponding to each operation in the ready list
(line 5). Operations that are to be scheduled in the current cycle and their mapping to a function
unit is determined in get bind set function. The get bind set function (Algorithm 2) takes care of
constraints (3.2), (3.3), and (3.4). The operations in the bind set are scheduled at the current cycle
(line 7). Note that for the purpose of the algorithm, we are using ϕ as an array rather than a func-
tion, with similar semantic. After scheduling operations in the current cycle, t is incremented and
the loop is repeated till all the nodes are scheduled.
FU Binding and RF-FU Interconnections
In the underlying architecture we assume that an FU may perform different types of operations; the
type of operations may differ from one FU to other (in other words, we may have heterogeneous
FUs). Further, in direct interconnection, each RF port is shared with a set of specific FU ports,
which implies that only one of these FUs can use this port at one time. Thus, the assignment of
an FU to an operation may lead to non-availability of other FUs due to path conflicts. FU binding
in case of direct interconnection needs to take care of both the heterogeneous FUs and the RF port
sharing, while in case of complete interconnect, FU binding need to take care of only heterogeneous
FUs.
Let X be the set of all types of operations and Xi be the set of types of operations that can
be executed by the ith FU. A function T : V → X defines the types of operations associated with
different nodes.
45
3 Code Generation for Proposed Architecture
The binding problem can be defined as an integer labeling ψ : V → Z+ such that
(i) Each operation can be bound to a FU where it can be executed.
T (vi) ∈ Xψ(vi) ∀i. (3.8)
(ii) No two operations can have the same schedule time as well as the same binding option, i.e.,
an FU can not be used by two operations at the same time.
(ϕ(vi),ψ(vi)) 6= (ϕ(v j),ψ(v j)) ∀i, j, i 6= j. (3.9)
(iii) No two FUs can access same RF port in the same cycle:
∑i:ϕ(vi)=k
(r1i P1
jψ(vi)+ r2
i P2jψ(vi)
) ≤ 1 ∀ j,k. (3.10)
∑i:ϕ(vi)+xi=k
wiPwjψ(vi)
≤ 1 ∀ j,k. (3.11)
To solve these constraints we need a solution that considers all possible OP-FU mappings in a
cycle and then decides the optimal set of mappings. An optimal set of mappings has the maximum
number of operations bound in a cycle. We propose a conflict graph based heuristic, in which
conflicts of all possible mappings are found, and binding is done on the basis of least conflict.
A node in the conflict graph is a tuple containing the operation and the FU slot, < v i, fl > if
T (vi) ∈ Xl . Edges in the conflict graph represent conflicts. There is an edge between nodes <
vi1, fl1 > and < vi2, fl2 > if
(i) both operations are mapped to the same FU, i.e., l1 = l2,
(ii) both FUs are accessing the same read port, i.e., ∃ j |(r1i1P1
jl1 or r2i1P2
jl1) and (r1i2P1
jl2 or r2i2P2
jl2),
(iii) both operations write to RF using the same RF port and at the same time, i.e., ∃ j |(w i1Pwjl1 and wi2Pw
jl2 =
1) and (xi1 = xi2).
46
3.3 Proposed Scheduling and Binding Algorithm
Using the conflict graph binding algorithm binds the least conflicting FU to each operation.
Algorithm 2 get bind set()Input: ready list, priority, RT, M, tOutput: bind set, ψ
1: bind set = φ2: confl graph = build conflict graph(ready list, RT, M)3: for v ∈ ready list in priority order do4: for f ∈ FU in increasing order of conflict do5: if resource available(v,f, RT, t) = 1 then6: ψ[v] = f7: bind set.add(v)8: update res table(v, f, RT, t)9: update conflict graph(v, confl graph)
10: end if11: end for12: end for
The pseudo code of the binding algorithm (get bind set function) is shown in Algorithm 2. The
binding conflict graph is built on the basis of resources requirement of the operations and the FUs
to which operations can be mapped (line 2). Binding of a ready operation is done in the order of
priority (line 3–12). All the FUs where an operation can be mapped are considered in increasing
order of conflict (line 4). This set of FUs is computed from the binding conflict graph. If all
resources required (RF read and write ports) to execute v on the selected FU are available in the
reservation table, ψ is set (line 5 – 6), and the operation is added to bind set (line 7). Availability of
the FU, read ports, and write ports is toggled in the reservation tables based on usage (line 8), i.e.,
RTFU [ f , t] = 1,
RTR[ j, t] = 1|r1v P1
j f or r2v P2
j f ∀ j,
RTW [k, t + xv] = 1|wvPwk f ∀k.
In the conflict graph update, the nodes, conflicting nodes, nodes related to selected operation with
their edges are removed to satisfy constraints (3.9), (3.10), and (3.11).
47
3 Code Generation for Proposed Architecture
ADD
OP1SUB
OP2LOAD
OP3LOAD
OP4LOAD
OP5CMPP
OP6
INR
OP9MUL
OP8SHRR1
R1 R2 R3 R4 R5 R6 R7 R8
OP14ST
OP13ST
OP12
OP10 OP11
OP7
ADD ADD
BR
0
1 11
1
11
1
1 1 1 1
1
Figure 3.2: Example data-flow graph.
FU 1 (X1) INT - MEM -FU 2 (X2) INT FLOAT - -FU 3 (X3) INT - MEM -FU 4 (X4) INT - - BRANCH
Table 3.1: Type of operations that can be executed on each function unit
Example
We illustrate our scheduling and binding algorithm with the help of an example data flow graph
(DFG) shown in Fig. 3.2. Each node in the graph represents an operation and different node col-
ors/shades are for different FU types. The edges between the nodes are data dependency or control
dependency edges. For a simplified view of the DFG, memory dependency, output dependency and
anti-dependency edges are not shown. Each edge is associated with an edge weight that signifies
48
3.3 Proposed Scheduling and Binding Algorithm
Final Binding
FU1 FU2 FU3 FU4 R1 R2 R3 R4
OP3
OP4
FU1 FU3 FU4 R3 R4
FU1 FU3
FU3
OP1−FU2
OP2−FU4
OP6OP5OP4OP3OP2OP1
order)(In prior.
Ready List Conflict Values
OP1
OP2
FU2FU1 FU3 FU4
2
3 3
4 4 2
2525
Reservation Table
R4 OP3−FU1
Figure 3.3: Operations binding example.
the minimum time interval between the schedule time of two nodes. The figure also shows the RF
reads for each operation explicitly in circles (Labeled as R1, R2, etc.).
We schedule this graph for three cases:
(i) Architecture with reduced read ports and complete interconnection.
(ii) Architecture with reduced read ports and direct interconnection.
(iii) Architecture with reduced read and write ports, and complete interconnection.
All the above three cases use a 4 issue width processor. The types of operations that can be
performed by each issue slot are shown in Table 3.1.
Architecture with reduced read ports and complete interconnection We schedule the DFG
for a four issue VLIW architecture with a 4 read and 4 write port RF and complete interconnect.
For complete interconnection, the only conflict is due to heterogeneous function units; therefore,
conflict value of a FU-OP tuple is the number of other ready operations that can be mapped to that
FU.
49
3 Code Generation for Proposed Architecture
In the first cycle, six operations are available in the ready list (shown in Fig. 3.3). The figure
shows ready operations in priority order. Conflict values of various FUs corresponding to the highest
priority operation ready for schedule, and reservation table. In reservation table, write port resources
are not shown, as they are not constrained in the current example.
The ready operations are bound in order of their priority (line 3, get bind set()) to the least con-
flicting FU. Since three of the ready operations are MEM type which can be mapped only on FU1
and FU3, the conflict value for FU1 and FU3 is high and for FU2 and FU4 is low. OP1 being first
ready operation, is bound to the least conflicting FU2 and the resources required for OP1 (FU2,
and two read ports) are removed from the reservation table. The conflict graph is also updated after
OP1-FU2 binding. In the same way, OP2 is bound to FU4, and OP3 being a memory operation is
mapped to FU1. None of OP4, OP5, OP6 could be scheduled in the first cycle due to unavailability
of read port resource.
In cycle 2, due to availability of operands from the SIRO buffers, OP7, OP8, and OP9 do not
require any register file read and therefore, OP7, OP8, OP9, and OP4 are scheduled in this cycle.
Similarly in the third cycle, the available operations are OP10, OP11, OP5, and OP6. Due to
availability of operands in SIRO buffers, the number of read ports required is 4 instead of 7 and all
these 4 operations can be scheduled in this cycle. In the last cycle the remaining three operations
are scheduled. The resulting schedule is shown in Fig. 3.4. We observe that in example DFG 50%
reduction in RF read ports did not lead to performance degradation due to availability of operands
in the SIRO buffers.
Architecture with reduced read ports and direct interconnection We consider a 4 issue width
processor with 4 read and 4 write ports. The direct interconnect is as shown in Fig. 2.16(a). Consider
the operations of cycle 3 in the scheduled graph shown in Fig. 3.4.
The conflict graph corresponding to the possible mappings is shown in Fig. 3.5. Solid edges in
Fig. 3.5(a) show the conflicts due to path conflict and dotted edges in the Fig. 3.5(b) show conflicts
due to FUs. After observing scheduled graph in Fig. 3.4, we notice that OP10 and OP5 require only
the left operand from the RF. OP11 gets both its operands from SIRO buffers, and OP6 requires
50
3.3 Proposed Scheduling and Binding Algorithm
4 Reads0 Write
0 Write
0 Read
0 Write
1 Read
BR.4
OP14
4 Reads
Scheduled Graph RF Reads/
Writes
ADD.2 SUB.4
OP2Load.1
OP3
SHR.2
OP7MUL.4
OP8INR.1
OP9 OP4
OP10 OP11ADD.4
OP5 OP6
ST.1
OP12ST.3
OP13
LOAD.3
LOAD.1 CMPP.2
OP1
ADD.3
Figure 3.4: Schedule for the 4 issue slot VLIW processor with 4 read port and4 write port RF.
both of its operands from the RF. Based on this resource requirement and resource information
given by interconnection matrix (Fig. 2.16(a)) and type of operations that can be performed by each
FU (Table 3.1), the conflict graph is constructed. The overall binding conflict graph is formed by
the superposition of Fig. 3.5(b) and 3.5(a) and overall conflict at a node is sum of all edges.
Using this conflict graph operations are bound to FUs on the basis of minimum conflict. In order
of priority, first OP5 is considered. For OP5, FU1 and FU3 are the least conflicting function units
and we choose FU1 as it is the first available FU. After this binding, OP5-FU1 node along with all
conflicting mapping, and nodes related to OP10 are pruned from the graph. The edges of the pruned
nodes are also pruned from the graph. The resulting binding conflict graph is shown in Fig. 3.6(a).
In this graph edges due to path conflicts and FU conflicts are drawn in same graph.
The next operation in priority is OP6. Fig. 3.6(a) shows that for OP6, FU2 and FU3 have equal
priority. FU2 is selected as it is the first available FU and the graph is pruned as it was done for
OP5. The resulting graph is shown in Fig. 3.6(b). Next, OP10 is bound to FU4 and OP11 to FU3
51
3 Code Generation for Proposed Architecture
OP10−FU3 OP5−FU3
OP5−FU1OP10−FU1
OP10−FU2
OP10−FU4
OP11−FU1
OP11−FU2
OP11−FU3
OP11−FU4
OP6−FU1
OP6−FU2
OP6−FU3
OP6−FU4
(a) Bind graph due to RF port sharing.
OP10−FU3 OP5−FU3
OP5−FU1OP10−FU1
OP10−FU2
OP10−FU4
OP11−FU1
OP11−FU2
OP11−FU3
OP11−FU4
OP6−FU1
OP6−FU2
OP6−FU3
OP6−FU4
(b) Bind graph due to heterogeneous FUs.
Figure 3.5: Example binding conflict graph for cycle 3 of scheduled graph inFig. 3.4.
without any conflict.
Architecture with reduced read and write ports, and complete interconnection In this case
we schedule the subject graph for a 4 issue width processor, with 4 read port and 3 write port RF.
The resulting schedule (shown in Fig. 3.7) takes 5 cycles instead of 4. The scheduler conservatively
assumes that no RF write can be avoided, so it reserve resource for all RF writes. Consequently, in
52
3.3 Proposed Scheduling and Binding Algorithm
OP10−FU3
OP10−FU2
OP10−FU4
OP11−FU2
OP11−FU3
OP11−FU4
OP6−FU2
OP6−FU3
(a) Bind graph after binding of OP5.
OP10−FU4
OP11−FU3
OP11−FU4
(b) Bind graph after binding of OP6.
Figure 3.6: Example binding conflict graph for cycle 3 of scheduled graph inFig. 3.4 after binding of OP5 and OP6.
all the schedule cycles, maximum of three RF writes are scheduled.
3.3.3 Iterative Schedule Improvement
The RPA scheduling described above takes care of all the constraints of read/write port and intercon-
nection. The write port resources of an operation are reserved in the current cycle as write avoidance
can only be determined in future cycles. Therefore, the resulting schedule does not benefit from the
fact that the writes can be avoided due to operand reads from the SIRO buffers.
We reschedule the output of the RPA sched algorithm as resources due to avoided writes are
available only after scheduling. We list the operations that can be scheduled in earlier cycles due
to availability of write ports. An operation is scheduled if all the resources of the operation are
available, and other constraints are satisfied. For example, in Fig. 3.7, in the second clock step,
resources are available to schedule OP4. With the vacancy created by OP4 in the third cycle, OP5
and OP6 are rescheduled to cycle 3. In this way, we observe that the new schedule is the same as
the schedule of Fig. 3.4. In this case we get the optimum schedule in a single iteration. However,
the procedure of improvement can be repeated till we see no further improvement.
53
3 Code Generation for Proposed Architecture
0 Read
2 Reads0 Write
Cycle 1
Cycle 2
Cycle 3
Cycle 4
4 Reads
0 Write
3 Reads0 Write
0 Read
0 Write
Cycle 5
ADD.2
OP1SUB.4
OP2Load.1
OP3
SHR.2
OP7MUL.4
OP8INR.1
OP9
OP4OP11ADD.4
OP10ADD.2
ST.1
OP12 OP5CMPP.2
OP6
ST.1
OP13 OP14BR.4
LOAD.1
LOAD.3
Figure 3.7: Schedule for the 4 issue slot VLIW processor with 4 read port and3 write port RF.
Algorithm 2 is the pseudo code of Im-RPA scheduling algorithm that does iterative schedule
improvement. First, reservation table of each resource is initialized in accordance with input the
scheduled graph (line 2). All the operands for which write can be avoided are found and their
respective write port resources are freed from reservation tables (line 5).
For each schedule cycle, moveup list is calculated (line 8). moveup list is the list of those op-
erations which can be scheduled in the current cycle. All the operations in the moveup list are
checked for resource constraints. If the required resources are available, the operation is scheduled
and bound, and the reservation table is updated (line 10–14). By the end of the loop (line 7–17),
we have a new schedule. If the new schedule has a smaller schedule length, the whole process is
repeated until we benefit in terms of reduction in schedule length.
Although this algorithm is iterative in nature, convergence is guaranteed and the maximum num-
ber of iterations is less than the initial schedule length achieved by RPA sched. This can be seen as
54
3.4 Additional Compiler Support
Algorithm 3 Im-RPA sched()Input: V, MOutput: V
1: RT = init reservation tables(V, M)2: repeat3: for each v ∈ V do4: remove available write port(RT, v)5: end for6: for t=0 to schedule length do7: moveup list = find ready moveup op(V, t)8: for each i ∈ moveup list do9: fu pos = get fu position(i, V, M, t)
10: if fu pos > 0 then11: ϕ[i] = t12: ψ[i] = fu pos13: update res table(i, fu pos, RT, t)14: end if15: end for16: end for17: until schedule length reduction > 0
follows. In the first iteration, possible changes in 0th cycle of schedule will finalize since all possible
move up operations have been considered for schedule in cycle 0. Similarly after the k th cycle, the
schedule for cycle 0–k will not change. Thus the maximum number of iterations would always be
less than the initial schedule length of the graph.
3.4 Additional Compiler Support
3.4.1 Identifying Global and Local Reads and Writes
To inhibit RF reads and writes, the compiler during its analysis phase, identifies the operands which
are read from the SIRO buffers and also the writes that can be avoided. This analysis is done in
two stages. In first stage, we mark the results that are necessary to write to the RF. In the second
stage, the operand where RF read/write is avoided is marked. The analysis is done at the level of
the compilation region which is basic block in the most simple case.
In the first stage we mark the global writes. If a result is read by atleast one operation of any other
55
3 Code Generation for Proposed Architecture
SIRO Code0: ADD.1 s1_1, Abase, y; ADD.2 s2_1, Bbase, z; ADD.3 s3_1, Cbase, x;1: LOAD.1 s1_1, s1_1; LOAD.2 s2_1, s2_1;2: ADD.1 C, s1_1, s2_1;3: STORE.1 s1_1, s3_3;
Figure 3.8: Example:Register renaming.
basic block then it is a global write. All the operands marked global write are essential to write to
the RF. We use liveness analysis of standard compiler to determine these essential writes. For each
basic block, liveout operands are available from the liveness analysis. All the operations in the basic
block are iterated in the reverse order, if the result of an operation is in liveout list, it is marked as
global write. If an operand in the liveout list is already marked global write, other instances of the
same operand in the basic block are not marked as global write.
Similarly, global reads may be identified in this phase but that information is not useful for finding
essential reads. The second stage of determining RF reads and writes is done during scheduling as
explained in the previous section.
3.4.2 Code Generation: Register Renaming
For correct code generation registers are renamed to the SIRO registers. Total number of registers
given to register allocation algorithm is the number of registers in the second level RF. After register
allocation and post pass scheduling, registers corresponding to the operands available from SIRO
buffer are renamed with that SIRO register. Note that no change is done in the write register address;
thus register value will be updated in the register file and can be used by another operation in
subsequent cycles. However, when the results of an operation is marked as ‘avoid RF write’ the
address of the destination register is also replaced by the corresponding SIRO register.
An example of code after register renaming is shown in Fig. 3.8 (The figure is reproduced from
Fig. 2.11). In the example the operands that are named with the prefix ‘s’ are SIRO registers.
56
3.5 Summary
3.5 Summary
In this chapter we clearly defined the role of the compiler for the proposed architecture. We proposed
a novel scheduling and binding algorithm. Our algorithm maximizes the number of SIRO reads,
and minimizes the performance loss due to reduced port and direct interconnect. Ideas of binding
conflict graph were suggested for FU binding, and iterative scheduling was proposed for reduced
write port RF architectures.
57
3 Code Generation for Proposed Architecture
58
4 Performance and Energy Models
In previous chapters we have proposed an architecture based on local buffers and reduced port RF.
Local buffers save energy by avoiding RF reads and writes without impacting the performance if the
size of the RF is unchanged. Port reduction further saves energy by reducing energy per RF access.
However, the number of execution cycles may increase in this case, leading to some performance
degradation.
In this chapter we theoretically model the performance and energy of applications with local
buffer based reduced port RF architecture. The base architecture is HPL-PD architecture [Kathail
et al., 2000]. Though HPL-PD is a completely parametrized architecture, the only parameters con-
sidered by our model are issue width, number of reads ports, and number of write ports. Apart from
architecture, the other input to the model is an application. First the application is characterized,
then application characteristics are used for estimating performance and energy for the proposed
architecture. The input and output of our modeling framework are shown in the Fig. 4.1.
First we describe the performance and energy model for the reduced port architecture for fixed
issue processor and in the next section we generalize it for any issue processor.
4.1 Model for Fixed Issue-width Processor
In this section we model the performance of VLIW architecture with shared port RF and compute the
additional execution cycles required due to port reduction, keeping the issue width of the processor
fixed. The model needs to account for the fact that the RF port requirement is reduced due to
availability of some values in the SIRO buffers. In a shared port RF architecture, in spite of these
59
4 Performance and Energy Models
Architecture
Model
Performance Energy
Application
Figure 4.1: Basic block diagram of the ILP model.
reductions in port usage, certain instructions may still require more number of ports than available.
Such instructions may require additional cycles for getting re-scheduled.
4.1.1 Performance Model
For an N issue processor, unconstrained RF has 2N read ports and N write ports. If read and write
ports of the register file in such a processor are limited to k and m respectively, the instructions which
require more than k read ports or m write ports need to be rescheduled such that port constraints are
met. The number of cycles required in scheduling such instructions due to read port constraint can
be estimated as total number of reads in these instruction divided by k. Similarly the number of
cycles due to write constraints can be estimated.
For the above computation, we use two vectors R and W of length 2N and N, respectively. R i (or
Wi) denotes the number of cycles using i read ports (or i write ports) in execution of an application
on the architecture with unconstrained RF ports. The additional cycles due to read port constraint
(Cycle+read ) and write port constraint (Cycle+
write) can be estimated as:
60
4.1 Model for Fixed Issue-width Processor
Cycle+read =
∑2Nj=k+1 R j ∗ j
k−
2N
∑j=k+1
R j. (4.1)
Cycle+write =
∑Nj=m+1Wj ∗ j
m−
N
∑j=m+1
Wj. (4.2)
The additional cycles due to both read port constraint and write port constraint is estimated by
addition of both.
Cycle+ = Cycle+read +Cycle+
write. (4.3)
Total number of cycles is
C =2N
∑j=0
R j +Cycle+. (4.4)
For an application with characteristic vector R and W computed once, our model can be used
to find approximate execution cycles for same issue width processor with any number of read and
writes port in the RF. Note that the model is an approximation of the actual scheduling process and
therefore, it can overestimate or underestimate performance. The actual performance depends on
many factors, like, data dependency in the application, quality of scheduling and binding, intercon-
nection topology for shared port RF, etc.
4.1.2 RF Energy Model
Energy consumed by RF is the sum of the dynamic energy and leakage energy spent by it. Dynamic
energy depends on activity and can be calculated by counting the number of RF accesses and energy
per access. Leakage energy depends on leakage current and time of execution. Leakage current is
usually fixed for a supply voltage. Execution time is the number of cycles multiplied by cycle time.
Thus, RF energy, Er f is
Er f = EreadNread +EwriteNwrite +Pr f leaktC. (4.5)
61
4 Performance and Energy Models
where, Eread , Ewrite, Nread , Nwrite, Pr f leak , t, and C are energy per read access, energy per write,
number of RF reads, number of RF writes, RF leakage power, clock period and number of cycles,
respectively.
Nread and Nwrite are the number of RF reads and writes after avoiding redundant reads and writes
due to operands available in SIRO buffers. They can be calculated by the characteristic vector R
and W:
Nread =2N
∑j=1
R j ∗ j. (4.6)
Nwrite =N
∑j=1
Wj ∗ j. (4.7)
The total number of operand reads and result writes in an application is fixed. However, operand
reads from SIRO buffers may change with number of reads ports and write ports as the schedule
gets elongated with reduction in ports. Considering this change in Nread and Nwrite as second order
effect, it is ignored for computation of RF energy.
The total number of cycles, C, can be calculated from equation (4.4). The RF access energy
for each read and write can be calculated using Cacti 4.0 [Tarjan et al., 2006]. Cacti 4.0 is an
analytical model for SRAM which estimates area, power, energy, and access time. As Cacti is also
an analytical model, it is easily integrated into our model.
The RF energy depends on energy per access and number of cycles. Energy per access reduces
whereas the number of execution cycles C increases with decrease in number of ports. Thus, in
equation (4.5), the first two terms decrease the RF energy and the third increases it, as the number
of ports reduces. Overall, RF energy reduces as the first two terms dominates the third. Also, the
scheduler attempts to minimize the increase in number of cycles in order to minimize the effect of
the third term.
62
4.2 Modeling of a Generic Processor
4.2 Modeling of a Generic Processor
A generic processor may have any number of function units, any number of read and write ports in
the RF. Performance of an application depends on both the resources present in the architecture as
well as the parallelism in the application. Before moving ahead, we refer to definitions of instruction
level parallelism as giving by Gangwar [2005]. The parallelism present in an application without
any hardware or compiler constraints is defined as available instruction level parallelism (ILP).
The parallelism present in the application for a given compiler without hardware constraints is
defined as achievable-S ILP – instruction level parallelism. The parallelism present with no compiler
constraints for a given hardware is achievable-H ILP. The achieved ILP is parallelism with a specific
hardware configuration for a given compiler. Achieved ILP is the result of interaction between
achievable-S ILP and achievable-H ILP.
4.2.1 Performance Model
Jouppi [1989] suggested a first order approximation for calculating the achieved ILP. He suggested
that the achieved ILP in an application is the minimum of achievable-H ILP and achievable-S ILP
as written in (4.8).
Achieved ILP = MIN(Achievable-H ILP, Achievable-S ILP). (4.8)
He further suggested that due to non-uniform parallelism achieved ILP is less than achievable-S
ILP when achievable-H ILP is equal to achievable-S ILP. Noonburg and Shen [1994] gave a theoret-
ical model of the performance based on Jouppi model. Their model uses control and data parallelism
distributions in the application and fetch, issue, and branch parallelism of the architecture to find
achieved ILP. We use Noonburg’s model for deriving performance estimates in a generic VLIW
processor.
Parallelism in an application with certain input is described by parallelism vector, x.
63
4 Performance and Energy Models
x = [ x1 x2 x3 . . . ], (4.9)
where xi is the number of cycles having parallelism of degree i. Using the parallelism vector, ILP
can be calculated as
Achievable-S ILP =∑i xi ∗ i
∑i xi. (4.10)
Effect of Limiting the Number of FUs
When we limit the number of issue slots to k (we assume k uniform FUs in a k issue width processor),
we assume
• The cycles that have parallelism more than or equal to k, will be limited to parallelism of k,
and the corresponding operations are distributed with parallelism of degree k.
• The cycles that have parallelism less then k will not get affected.
Thus new elements in the parallelism vector are
x′i =
xi if i < k∑∞
j=k x j∗ jk if i = k
0 if i > k.
(4.11)
The achievable ILP is calculated using a modified parallelism vector by using equation (4.10). The
number of additional cycles due to issue width limitation is given by Cycle+issue
Cycle+issue = ∑
i
x′i −∑i
xi (4.12)
Using equation (4.10) and (4.11), we can write Cycle+issue as:
Cycle+issue =
∑∞j=k x j ∗ j
k−
∞
∑j=k
x j (4.13)
64
4.2 Modeling of a Generic Processor
The proposed model based on the above assumptions closely follows the simulation results except
for a few cases. The reasons for the anomalies are as follows. First, the number of cycles given in
the vector is distributed and thus accumulated over the whole application, while calculated value of
x′k assumes them lumped. This may lead to underestimation of cycles and overestimation of ILP.
Second, it is assumed that cycles with parallelism of degree less than k will not be effected. This
may not be true always; the increase in schedule length because of issue width limitation, leads to
increase in slack for operations of low ILP cycles. The available slack may increase the parallelism
and lead to under-estimation of the ILP.
Effect of Limiting the Number of RF ports
To account for the effect of reduced read and write ports, similar parallelism vectors are defined.
We define different read vectors for local and non-local reads. All the reads from local buffers are
considered as local reads while reads from second level RF are considered as nonlocal reads. The
local readsi and nonlocal readsi denote the number of cycles having i simultaneous reads from
SIRO and RF, respectively. Similarly, local writei is the number of cycles in which i simultaneous
writes are avoided from RF due to local buffers. nonlocal writei is the number of simultaneous
writes to the second level RF.
Due to port or FU constraint, the schedule length may be longer, and some operands may not be
available in local buffers. This may change some local reads to non-local reads and similarly local
writes to non-local writes. The number of additional non-local reads (writes) is directly proportional
to the total local reads (local writes), and increase in execution time due to port or FU constraint.
Thus:
nonlocal read+ = αCycle+∞
∑i=0
i∗ local readi (4.14)
nonlocal write+ = βCycle+∞
∑i=0
i∗ local writei (4.15)
65
4 Performance and Energy Models
The proportionality constant, α and β has to be determined empirically. Cycle+ is the total
number of additional cycles due to port or FU constraints. Since we don’t know the additional
cycles due to port constraint, we substitute the value of Cycle+issue as first approximation. Once
we compute the additional cycles due to port constraints (4.20), we use that value to recompute
nonlocal read+ and nonlocal write+.
Increase in the number of execution cycles due to read port constraint depend on, (a) non local
reads having parallelism more than k, (b) additional non local reads. Additional cycles can be
calculated as:
Cycle+read =
∑∞i=k i∗nonlocal readi +nonlocal read+
k−
∞
∑i=k
nonlocal readi (4.16)
Similarly, the additional cycles due to limited number of write ports are calculated as
Cycle+write =
∑∞i=m i∗nonlocal writei +nonlocal write+
m−
∞
∑i=m
nonlocal writei (4.17)
Combined effect of port and FU limitation
The number of execution cycles increases when we limit issue width of processor to k. A similar
increase may be there when we limit read ports to 2k or write ports to k. To account for additional
cycles due to only read port constraint, Cycles+read only, we subtract the additional cycles due to issue
width constraint from the additional cycles due to read port constraint. If Cycle+read is less than
Cycle+issue, then the number of additional cycles due to only read port constraint is zero, i.e.,
Cycle+read only = MAX(0,Cycle+
read −Cycle+issue). (4.18)
Similarly, the number of additional cycles due to only write port constraint, Cycles+write only, is
Cycle+write only = MAX(0,Cycle+
write −Cycle+issue). (4.19)
Total additional cycles, Cycle+, is the sum of additional cycles due to each constraint.
66
4.2 Modeling of a Generic Processor
Cycle′+ = Cycle+issue +Cycle+
read only +Cycle+write only. (4.20)
As the values of nonlocal read+ and nonlocal write+ depend on additional cycles due to ports
and FU constraints, so the procedure of calculating Cycle+ is iterative. We substitute the value of
Cycle′+ as Cycle+ in equation (4.14) and (4.15) to recalculate the value of Cycle′+ till Cycle′+ is
same as Cycle+.
The achieved ILP is calculated as
Achieved ILP =∑i xi ∗ i
∑i xi +Cycle+. (4.21)
4.2.2 RF Energy Model
We use the energy model discussed in Sec. 4.1.2. In the generic processor model, the number of
RF reads and writes also changes with the read ports and write ports. From the discussions in Sec.
4.2.1 the number of reads and writes for reduced port RF architecture can be calculated as
Nread = nonlocal read+ +∑i
i∗nonlocal readi. (4.22)
Nwrite = nonlocal write+ +∑i
i∗nonlocal writei. (4.23)
In case of only FU constraint, the above equations can be used to calculate Nread and Nwrite,
considering the number of read and write ports as 2N and N for N issue processor. The number of
cycles, C can be calculated as
C = ∑i
xi +Cycle+. (4.24)
The RF energy is computed using these values and values from Cacti in equation (4.5).
67
4 Performance and Energy Models
4.3 Summary
In this chapter we proposed a theoretical model of performance for reduced port RF architecture.
The model used architecture parameters and application characteristics to estimate the performance.
The model for fixed issue processor characterizes the application by executing the application on
a processor of same issue width but without port constraint. The generic model characterizes the
application by executing it on a very high issue width processor. Both the model are based on the
fact that there is non-uniform parallelism in the application. Further, RF energy can be modeled by
estimating the number of reads and writes in the RF.
The prediction of the model can be used in various ways. It can be used for predicting the
performance of an architecture without doing compilation and simulation. Thus our model can be
used in early phases of design space exploration. The modeling also gives us the insight into the
behavior of performance of reduced port RF architecture.
68
5 Model Validation and Evaluation of the
Proposed Architecture
In the previous chapters we have proposed an architecture and compiler algorithms for energy and
performance optimization in VLIW processors. This chapter discusses the experiments performed
to substantiate the claims. Section 5.1 discusses the implementation framework and experimental
setup. The rest of the chapter discusses the effect of the proposed architecture and compiler tech-
niques on area, number of avoided RF reads and writes, performance, and energy in Sections 2, 3,
4, and 5 respectively.
5.1 Implementation Framework
The proposed algorithms for operand analysis, scheduling and binding are implemented in Trimaran
compiler framework [Chakrapani et al., 2005]. Trimaran compiler is an open source compiler for
instruction level parallel architectures. The front end of Trimaran is IMPACT [Chang et al., 1991]
developed at University of Illinois, Urbana-Champaign. IMPACT takes a ‘C’ program as input and
performs various high level compiler optimizations to extract parallelism. The back end of Tri-
maran is Elcor, developed at HP Labs. Elcor performs architecture specific compiler optimizations
and performs scheduling and register allocation. The output of Elcor is read by simulator ‘Simu’
which emulates applications on HPL-PD architecture. The simulator gives the number of execution
cycles and other execution statistics. The processor is described in a high level machine description
language (HMDES), which is the input to Elcor as well as Simu.
69
5 Model Validation and Evaluation of the Proposed Architecture
C ParsingOptimization
ELCOR
IMPACT
Function Inlining
ProfilingRegion Formation
Classical
Machine levelOptimization
DFG Formation
Register allocation
Machine
Description
Simulator
Application
Scheduling
Performance Statistics
Figure 5.1: Experiment framework
We augmented the HPL-PD architecture given in HMDES with the information required for our
proposed compiler algorithms. The additional information provided is, number of read and write
ports, type of RF-FU interconnection, RF-FU interconnection matrix, and depth of SIRO buffers.
The simulator was augmented to provide additional statistics related to RF reads and writes. To
avoid the effect of limited number of registers in register files such as register spilling, we simu-
lated the scheduled code with virtual registers. Also, no memory hierarchy was considered for the
simulations.
5.1.1 Base Architecture
We used a processor with a fixed issue width to understand the effect of SIRO buffers and RF port
sharing and different issue width processors to validate the model. The issue width in commercial
VLIW processor is usually in the range of 2 to 16, with processors using 4 issue width or clusters of
4 issue being the most common (e.g., [Faraboschi et al., 2000; Seshan, 1998]). Therefore, we also
70
5.1 Implementation Framework
Issue slot 1 INT - MEM -Issue slot 2 INT FLOAT - -Issue slot 3 INT - MEM -Issue slot 4 INT - - BRANCH
Table 5.1: Function unit positions
use this range of issue widths; 4 issue VLIW processor is used as the base processor for fixed issue
width experiments. In the 4 issue width processor, there are 4 integer units, 1 floating point unit,
2 memory units, 1 branch unit and a 64 word register file. Function units are placed in different
issue slots as shown in Table 5.1. Based on observation from Fig. 2.3, SIRO depth is ‘2’ for all
experiments. Experiments were performed with different number of RF read and write ports. In
the experiments we refer to an RF configuration by the number of read and write port, e.g., 4r3w
configuration represents shared port RF with 4 read ports and 3 write ports.
5.1.2 Benchmarks
For experiments we used two sets of benchmarks. Set I of benchmarks is composed of Mediabench
[Lee et al., 1997] and Mibench [Guthaus et al., 2001]. Mediabench and Mibench consist of high
end embedded applications. The Set II of benchmarks consists of a number of kernels and appli-
cations from the embedded systems domain with high instruction level parallelism (ILP). Certain
transformations like loop unrolling [Davidson and Jinturkar, 1995], constant folding, and tree height
reduction [Mahlke et al., 1992] are used to further enhance ILP of Set II benchmark applications.
The two sets of benchmarks represent different workload conditions. Set I benchmarks represent
the standard embedded applications, while Set II represents applications which have inherently
high ILP or compiler is using aggressive optimizations to extract ILP. The two sets have different
resource requirements such as RF read, RF writes, and FU required in a cycle. Therefore, the two
sets are suitable to evaluate shared port RF architecture. Resource requirement can be characterized
71
5 Model Validation and Evaluation of the Proposed Architecture
Set I Set IIBenchmarks ILP Benchmarks ILPbasicmath 1.12 mm int32 18.29dijkstra 1.03 dct int2 13.4bitcount 1.82 sobel 10.96blowfish 2.38 convolution 17.48FFT 1.31 hamm 28.96patricia 1.13 colorspace 16.76qsort 1.26 mm int8 4.63sha 3.03 dct int 7.44g721encode 1.36 susan 4.29g721decode 1.37 viterbi 4.78gsmdecode 1.39 rijndael 4.4gsmencode 2.23unepic 1.23rawcaudio 1.44rawdaudio 1.52pegwitdec 1.46pegwitenc 1.42
Table 5.2: Benchmark characteristics.
by the achievable-S ILP in the application. Notice that the achievable-S ILP in an application is
independent of the processor and issue width, though it depends on the compiler. We compute
achievable-S ILP in an application by compiling and simulating the benchmarks for very high rate
processor so that resources are not the constraint. Presently we use a 64 issue processor for this
purpose. The ILP values for each benchmark is shown in Table 5.2.
72
5.2 Model Validation
5.2 Model Validation
5.2.1 Performance Model for Fixed Issue Width Processor
Our analytical model estimates the schedule length by considering the application characteristics,
such as parallelism and resource requirement. We validate the model by code generation and simu-
lation for the corresponding architectures.
We compiled and simulated benchmark applications for different register file configurations and
normalized the execution cycles with respect to cycles of 8r4w configuration. Normalized cycles
for all the benchmarks are averaged. Figure 5.2 shows the average normalized cycles for different
RF configurations as estimated by the performance model (Section 4.1) and as obtained by the code
generation and simulation for individual RF configurations.
Figure 5.2(a) shows the performance comparison for Set I benchmark applications. It is observed
that the performance estimate of the model is always within 2% of the performance obtained by
simulations. For Set II benchmarks (Fig. 5.2(b)) the performance difference is within 12%. These
results clarify two points – first, the performance model closely models the behavior of reduced port
RF architecture, second, the proposed scheduling and binding algorithm is effective in optimizing
the schedule for the proposed architecture.
5.2.2 Performance Model for Generic Processor
To validate the model proposed in Section 4.2, we simulated high ILP benchmarks for different issue
width processors varying from 2 to 16. Each issue width processor has four different read/write port
RF configurations, first with 2N read ports and N write ports, second with N read ports and N write
ports, third with N read ports and 3/4N write ports, forth with N/2 read ports and N/2 write ports.
For each configuration, we estimate the achievable ILP using the model and find the ILP value by
compilation and simulations.
The results are shown in Fig. 5.3. Lines with solid points represent estimated values while lines
with empty points are values from simulations. Average root mean square (RMS) error for all these
73
5 Model Validation and Evaluation of the Proposed Architecture
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.1
2r1w
2r2w
2r3w
2r4w
4r1w
4r2w
4r3w
4r4w
6r1w
6r2w
6r3w
6r4w
8r1w
8r2w
8r3w
8r4w
Ave
rage
nor
mal
ized
cyc
les
RF Configurations
ModelSimulation
(a) Normalized average number of cycles for Set I.
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2r1w
2r2w
2r3w
2r4w
4r1w
4r2w
4r3w
4r4w
6r1w
6r2w
6r3w
6r4w
8r1w
8r2w
8r3w
8r4w
Ave
rage
nor
mal
ized
cyc
les
RF Configurations
ModelSimulation
(b) Normalized average number of cycles for Set II.
Figure 5.2: Model validation against simulation results.
74
5.3 Architecture Evaluation
0
2
4
6
8
10
0 2 4 6 8 10 12 14 16
Ach
ieva
ble
ILP
Number of issue slots
2N,N:Model2N,N:SimuN,N:ModelN,N:Simu
N,3/4N:ModelN,3/4N:Simu
N/2,N.2:ModelN/2,N/2:Simu
Figure 5.3: Model validation for different issue width processors and differentread write port configurations
configurations was found to be 14.2%. We observe that for highly constrained architectures such as
those with N/2 read ports and N/2 write ports, the error is usually larger. Similarly for 2 issue width
processor, the estimation has large deviation from simulated values. If we exclude these extreme
cases, the average RMS error is 7.6%. Average of absolute errors in mildly constrained architectures
is 5.2%.
5.3 Architecture Evaluation
5.3.1 Area
The proposed architecture affects the area of the processor as well as register file. The following are
the important factors; (a) Size of the SIRO buffers, (b) SIRO controller, (c) Change in the number
of registers in RF due to SIRO buffers, and (d) RF area savings due to port reduction.
As we have shown in Chapter 2, SIRO buffers with depth equal to bypass depth have no impact
75
5 Model Validation and Evaluation of the Proposed Architecture
on processor area. Effects due to reduction in size and ports are well known [Wilton and Jouppi,
1996] so we do not focus on that. We study the effect on size of controller due to the proposed
architecture (Section 2.3.2).
To study the area savings due to SIRO controller, we used a parametrized RTL model of a VLIW
processor core. This processor core contains dispatch, decode, and execute units. Register file
and caches are not modeled in the processor core. We modeled conventional VLIW architecture
and SIRO buffer based VLIW architecture using the base model. In the first case of no compiler
information about bypass paths, all the operand address comparisons are done in hardware. As
discussed in Section 2.3.2, for N issue processor, N 2 address comparisons are required (depth being
1 in this case). One bit is generated from each comparison which indicates whether the operand is
to be read from that bypass path or not. Such bits are produced for each bypass path and for each
operand as shown in Fig. 2.13(a). In the second case, the information of SIRO register address is
available in encoded form in the instruction, and a circuit is required to decode that (Fig. 2.13(b)).
To obtain area values, the RTL model of the processor core was synthesized using Synopsys’ DC
compiler with 0.18µm UMC libraries. The area shown here is only logic area with no routing area
included.
As discussed earlier, the number of comparisons increases as we increase the number of function
units. Thus, the advantage of SIRO addresses is expected to be more in case of higher issue width
processor. Table 5.3 shows the area of different VLIW processor cores in the two approaches, i.e.
conventional processor and processor based on SIRO buffers. From the table it can be seen that area
gains of the order of 4% of total core area, are due to saving in bypass control overheads.
5.3.2 Number of SIRO Reads
To increase the number of reads from the SIRO buffers, we proposed two variations of the prior-
ity function in the list scheduling algorithm. Effects of the number of SIRO reads will be more
evident on high ILP applications; therefore, we used Set II for the validation. We used VLIW pro-
cessors with issue width varying from 1 to 16. All the applications were compiled for these VLIW
76
5.3 Architecture Evaluation
Issue Conventional Processor with % Areawidth processor (µm2) SIRO buffers (µm2) saved3 3.28e5 3.21e5 2.034 3.62e5 3.49e5 3.435 6.42e5 6.10e5 4.96 7.29e5 7.08e5 2.897 8.37e5 8.08e5 3.468 9.36e5 8.99e5 4.0
Table 5.3: Comparison of processor core area values with/out SIRO bufferinformation
architectures and simulated using Trimaran. Figure 5.4 shows the average ratio of reads from the
SIRO buffer paths to the total operand reads. The ratio of SIRO buffer reads is shown for all three
variations of the list scheduling algorithm.
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16
Ave
rage
nor
mal
ized
rea
ds fr
om b
ypas
s
Issue slots
Base list sched.Two step priorityModified priority
Figure 5.4: SIRO buffer reads for different issue processors.
It is clear from the results that as the issue width increases, the number of SIRO buffer reads
increases. List scheduling with the proposed modified priority function leads to the maximum
77
5 Model Validation and Evaluation of the Proposed Architecture
number of SIRO buffer reads. For a single issue processor, the increase is more than three times,
while for two issue processor the increase is approximately two times. Though in higher issue width
processor, the difference between the number of SIRO buffer reads due to different algorithms is
marginal, still list scheduling with modified priority function better in most cases. However, for
issue width 15 and 16, base list scheduling has 1% more number of reads from the SIRO buffers.
5.3.3 Performance
5.3.4 Direct Interconnect Evaluation
We evaluated different interconnection matrices to understand the gains due to the guidelines dis-
cussed in Section 2.4.1. We have used 4r4w configuration for this experiment which has maximum
write ports and reduced read ports. We used a sample of 24 different interconnection matrices (out
of 652) and grouped them by their RF port imbalance factor. For the 4r4w RF configuration, in
completely balanced interconnect, each RF read port is connected to two FU ports, one left FU port,
and one right FU port. In an interconnect configuration, if an RF port is connected to more FU ports
than in the balanced interconnection, the difference is called the port imbalance factor. The RF port
imbalance factor is calculated as the sum of the FU port imbalance, the left port imbalance, and the
right port imbalance at all RF ports. All the configurations having ‘0’ RF port imbalance form one
group, configurations having ‘1’ RF port imbalance form a separate group, and so on. The average
percentage increase in the number of cycles for each group for all benchmarks (of Set I) with respect
to 8r4w configuration is shown in Fig. 5.5. It can be clearly seen that the more the imbalance, the
more it costs in terms of performance penalty. In other words if an interconnect matrix follows all
the three guidelines, the performance penalty would be the least and may come close to complete
interconnection.
With this insight, the interconnection matrices for other RF configurations are selected based on
the minimum port imbalance and are shown in Table 5.4. Three entries of each row of a configura-
tion shows the two read and one write port of RF to which a particular issue slot is connected. Some
of these configurations are shown in Fig. 5.6.
78
5.3 Architecture Evaluation
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7
Ave
rage
% in
crea
se in
exe
cutio
n cy
cles
RF port imbalance
RF port imbalance
Figure 5.5: Direct interconnection RF architecture exploration.
8r4w 8r3w 8r2w 8r1w 6r4w 6r3w 6r2w 6r1wSlot 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0Slot 2 2 3 1 2 3 1 2 3 1 2 3 0 2 3 1 2 3 1 2 3 1 2 3 0Slot 3 4 5 2 4 5 2 4 5 0 4 5 0 4 5 2 4 5 2 4 5 0 4 5 0Slot 4 6 7 3 6 7 2 6 7 1 6 7 0 5 4 3 5 4 2 5 4 1 5 4 0
4r4w 4r3w 4r2w 4r1w 2r4w 2r3w 2r2w 2r1wSlot 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0Slot 2 1 2 1 1 2 1 1 2 1 1 2 0 1 0 1 1 0 1 1 0 1 1 0 0Slot 3 2 3 2 2 3 2 2 3 0 2 3 0 0 1 2 0 1 2 0 1 0 0 1 0Slot 4 3 0 3 3 0 2 3 0 1 3 0 0 1 0 3 1 0 2 1 0 1 1 0 0
Table 5.4: Interconnect matrices for different direct RF configurations
Evaluation of Scheduling and Binding Efficiency
To show the effectiveness of our scheduler in performing SIRO architecture specific optimizations,
we compare the proposed scheduling and binding algorithm with the results of applying only RPA
79
5 Model Validation and Evaluation of the Proposed Architecture
File
FU1
FU2
FU3
FU4
Register
0
1
6
7
2
3
4
5
0123
(a) 8r4w configuration.
File
1
2
3
4
FU1
FU2
FU3
FU4
Register
0
5
012
(b) 6r3w configuration.
File
0
1
2
3
FU1
FU2
FU3
FU4
Register
01
(c) 4r2w configuration.
File
1
FU1
FU2
FU3
FU4
Register
2
1
(d) 2r1w configuration.
Figure 5.6: Different direct RF configurations
scheduling algorithm without updating the value r1i and r2
i due to eq(3.5) (referred to as naive al-
gorithm in Fig. 5.8). Thus, naive algorithm does everything to ensure the correctness of scheduling
and binding, but does not use the information that values are available in SIRO buffers. While
comparing naive algorithm with the proposed algorithm for complete interconnection, we observe
that as number of RF read or write ports decreases, the performance of naive algorithm deteriorates
rapidly while our algorithm is able to cope up.
80
5.3 Architecture Evaluation
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.1
2r1w
4r2w
6r3w
8r4w
2r4w
4r4w
6r4w
8r4w
8r1w
8r2w
8r3w
8r4w
Ave
rage
nor
mal
ized
cyc
les
RF Configurations
Complete interconnectDirect interconnect
(a) Normalized average number of cycles for Set I benchmarks.
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2r1w
4r2w
6r3w
8r4w
2r4w
4r4w
6r4w
8r4w
8r1w
8r2w
8r3w
8r4w
Ave
rage
nor
mal
ized
cyc
les
RF Configurations
Complete interconnectDirect interconnect
(b) Normalized average number of cycles for Set II benchmarks.
Figure 5.7: Performance evaluation.
81
5 Model Validation and Evaluation of the Proposed Architecture
1
1.5
2
2.5
3
3.5
8r4w 8r3w 8r2w 8r1w 6r4w 6r3w 6r2w 6r1w 4r4w 4r3w 4r2w 4r1w 2r4w 2r3w 2r2w 2r1w
Ave
rage
nor
mal
ized
cyc
les
RF Configurations
RPA Algorithm (Complete Interconnect)Naive algorithm
Figure 5.8: Effectiveness of RPA scheduling algorithm with respect to a naivealgorithm (Set II benchmarks)
The performance predicted by the model is compared with the performance of complete intercon-
nect architecture, the performance model does not model the RF-FU interconnections. However, the
model may be used for estimating the performance of direct interconnects as the performance dif-
ference between complete interconnection and direct interconnection architecture is marginal. To
demonstrate this, Fig. 5.7 shows the average normalized cycles for different RF configurations for
direct interconnect and complete interconnect.
Figures 5.7(a) and 5.7(b) show the average normalized number of cycles for some selected config-
urations with direct interconnect and complete interconnect architectures. Both figures have three
pairs of curves. The first pair is for decreasing number of write ports, the second for decreas-
ing number of read ports, and the third for both read and write ports decreasing. It is clear that
complete interconnect performs always better, because of the absence of path conflicts. For Set II
82
5.3 Architecture Evaluation
benchmarks, the average number of cycles over all benchmarks in the architecture with direct in-
terconnect is within 8% of the number of cycles in the architecture with complete interconnect, for
any RF configuration. In case of Set I benchmarks this figure is 2%.
The first graph in Fig. 5.7(a) and 5.7(b) shows the variation of average normalized cycles with
decreasing write ports. The number of cycles increases by 60% for single write port, but it is only
9% and 20% for three write and two write port cases, respectively. This shows that performance
with decreasing write ports deteriorates rapidly. On the other hand, the effect of reducing read ports
on the performance is less than the effect of reducing write ports. The second Set of graphs in the
same figure shows the performance variation with the read ports keeping write ports fixed. There
is almost no difference between the performance of 6 read ports and 8 read ports configurations.
In case of 4 read ports configuration, the performance loss is marginal at 5%. The increase in the
number of cycles is significant only when read ports are reduced to 2. The third pair of graphs shows
the impact on number of cycles when both read and write ports decrease in the same ratio. For these
configurations the increase in number of cycles is both due to read port reduction as well as due
to write port reduction. Therefore, overall increase in number of cycles for 2r1w configuration is
much higher than increase due to individual read and write effects.
There is a marked difference between the performance observed for Set I benchmarks (Fig.
5.7(a)) and Set II benchmarks (Fig. 5.7(b)). The performance penalty due to port and path con-
flicts for Set II is much larger than that for Set I benchmarks because of higher demand of read and
write ports in the applications.
5.3.5 Energy
We have used Cacti 4.0 [Tarjan et al., 2006] to estimate read and write access energy for the direct
interconnect RF. For complete interconnect RF, we modeled interconnects in Cacti. We observe
that a significant fraction of the energy is consumed in multiplexing for complete interconnection;
therefore, energy per access in complete interconnect is always more than the corresponding direct
interconnect RF configuration.
83
5 Model Validation and Evaluation of the Proposed Architecture
Using equation (4.5), RF energy of different configurations is calculated, normalized with respect
to energy of standard RF and shown in Fig. 5.9. We assumed 0.13µ technology, 85 ◦C temperature
[Skadron et al., 2004], 1 GHz frequency, 64 bit 64 word RF for calculating the energy. Standard RF
is similar to the 8r4w configuration but does not inhibit RF read/write when values are available in
SIRO buffers. Therefore, the 8r4w configuration in our case saves 40% both in case of Set I as well
as Set II benchmarks with respect to standard RF energy by avoiding redundant reads and writes in
the RF. In other configurations, total energy saving is due to both the avoidance of redundant read
and write as well as due to reduced ports of the RF. There may be an increase in total energy due to
increase in leakage energy by additional execution cycles. It is observed that the direct interconnect
topology is always more energy efficient than the full interconnect topology, due to its lower energy
per access. For Set I as well as Set II benchmarks 2r1w configuration with direct interconnection is
the most energy efficient configuration saving 75% and 66% of energy respectively.
Figure 5.10 shows the normalized RF energy of the 8r4w configuration and the 4r4w configu-
ration with the complete interconnection and the direct interconnection for different benchmarks.
Number of reads from SIRO buffers are similar in both direct interconnect and complete intercon-
nect. The energy saving in 8r4w configuration is due to fewer reads and writes. Therefore, ‘rijndael’
with the fewest read/write from the RF, saves the most energy. For some benchmarks like ‘gsmde-
code’ energy saving in the 4r4w configuration with respect to the 8r4w is high while in ‘gsmencode’
it is less. The reduction in the energy due to RF dynamic access energy is similar; the difference is
due to leakage energy which depends on the number of cycles.
There can be several metrics for choosing an optimal configuration. For example, if energy is
the only criterion, 2r1w configuration with direct interconnect is best. If energy-delay product is
also considered then 4r4w configuration with direct interconnect is the best configuration for Set
II benchmarks, and 2r1w configuration with direct interconnect is best for Set I benchmarks. If
the minimum acceptable performance loss is the criterion, than some other configurations may be
preferred. In all cases, the shared port RF architecture is beneficial in terms of energy.
84
5.3 Architecture Evaluation
0
0.2
0.4
0.6
0.8
1
2r1w
2r2w
2r3w
2r4w
4r1w
4r2w
4r3w
4r4w
6r1w
6r2w
6r3w
6r4w
8r1w
8r2w
8r3w
8r4w
Ave
rage
Nor
mal
ized
RF
ene
rgy
RF Configurations
Complete interconnectDirect interconnect
(a) Normalized average RF energy for benchmarks in Set I.
0
0.2
0.4
0.6
0.8
1
2r1w
2r2w
2r3w
2r4w
4r1w
4r2w
4r3w
4r4w
6r1w
6r2w
6r3w
6r4w
8r1w
8r2w
8r3w
8r4w
Ave
rage
Nor
mal
ized
RF
ene
rgy
RF Configurations
Complete interconnectDirect interconnect
(b) Normalized average RF energy for benchmarks in Set II.
Figure 5.9: Normalized average RF energy for the direct and the completeinterconnect topologies.
85
5 Model Validation and Evaluation of the Proposed Architecture
8r4w4r4w with complete interconnect 4r4w with direct interconnect
0
0.2
0.4
0.6
0.8
1
basi
cmat
h_sm
all
dijk
stra
_sm
all
bitc
ount
FFT
patr
icia
qsor
t_sm
all
sha
g721
enco
deg7
21de
code
gsm
deco
degs
men
code
unep
icra
wca
udio
raw
daud
iope
gwitd
ecpe
gwite
nc.
mm
_int
32dc
t_in
t2so
bel
econ
volu
tion
eham
mco
lors
pace
mm
_int
8dc
t_in
tsu
san
vite
rbi
rijn
dael
Nor
mal
ized
RF
ener
gy
Figure 5.10: Normalized RF energy for different benchmarks.
5.4 Summary
In this chapter we discussed our experimental setup and implementation framework. We performed
experiments with a fixed issue processor of issue width 4. Our experiments suggest that with the
proposed architecture, we can save upto 4% of the processor core area due to simplified bypass
control in SIRO buffers. Apart from that register file size sees a significant decrease in size with
port reduction. Experiments also reveal that using SIRO buffers we can avoid around 60% of the
RF reads and 70% of the RF writes, on an average.
Our study shows that complete interconnection is less energy efficient than direct interconnection.
Though the direct interconnection has more compiler constraints yet performance losses are within
2% of the complete interconnection topology. Compiler support is important for the architecture.
With higher reduction of ports more performance penalties were observed, though these configura-
tions offer higher energy savings. Shared port architecture leads to more than 60% savings in the
86
5.4 Summary
RF energy. The number of ports in the RF can hence be selected on the basis of energy budget or
performance budget.
87
5 Model Validation and Evaluation of the Proposed Architecture
88
6 Varying Issue Width and Scalability
In the previous chapter we established with the help of experiments that the proposed architecture
and compiler algorithms perform well and there are significant savings in terms of RF energy. The
experiments in the previous chapter were performed on 4-issue width processors. In this chapter,
we study the performance of the proposed architecture with different issue widths.
6.1 RF and Processor Scalability
Multiple function units(FU) are used in processors to exploit instruction level parallelism. With
increase in the number of FUs, architecture of the processor should ideally scale, that is, a lin-
ear increase in area, power and a constant cycle time. Studies suggest that in both superscalar as
well as VLIW architectures the increase in the number of FUs results in poor scaling of processor
[Palacharla et al., 1996; Terechko et al., 2005].
In VLIW processors, the most unscalable component is the multi-port register file [Capitanio
et al., 1992]. In a classical design of VLIW processors, if a function unit requires 2 read port and
1 write port, 2N read and N write ports are required in the register file for N issue processor. Area,
power and access time of a large ported RF are highly unscalable [Rixner et al., 2000].
In all the previous studies of VLIW processors, clustering is accepted as the solution for the
register file scalability problem. In commercial VLIW processors as well, processors with high
issue width are clustered. For example, Trimedia [van Eijndhoven et al., 1999] is a five issue and
two cluster processor, TI’s TMS3206x [Seshan, 1998] is an 8 issue and two cluster processor, ST’s
Lx architecture [Faraboschi et al., 2000] has configurable number of clusters, each with issue width
89
6 Varying Issue Width and Scalability
of 4. Clustering involves multiple register files, and only a set of FUs are connected to each. Access
of a data element from one cluster to the other involves inter-cluster move operations. The number
of inter-cluster move operations increases the number of cycles and effects performance.
Our proposed architecture is a better alternative for RF scalability than clustered architectures
as it consumes lower energy and gives higher performance. As discussed in previous chapters, the
scalability of our proposed architecture is possible because of (a) distributed local buffers, (b) small
depth of local buffers, and (c) reduced port second level RF. Due to distributed nature and small
depth, local buffers do not cause any additional energy and delay over conventional VLIW architec-
ture, while reduced port RF helps in bringing down energy, area and delay cost. In addition to the
architecture, compiler’s scheduling and binding algorithms also help in minimizing performance
loss.
In this chapter we show that processors with our proposed RF architecture are scalable in perfor-
mance and energy with respect to issue width. Finally, we compare the scalability of reduced port
architecture and clustered VLIW architecture.
6.1.1 Related Work in Processor Scalability
Apart from the register file, other unscalable components of a VLIW processor are FU-FU/FU-RF
interconnect and issue logic [Palacharla et al., 1996; Gangwar et al., 2007; Zhong et al., 2005].
For high issue widths [Palacharla et al., 1996] suggest that FU-FU interconnection network can be
in critical path. Gangwar et al. [2007] showed that the FU-RF interconnection and inter-cluster
communication do not scale well with the bus based interconnects. Their study also suggests that
issue logic will be in the critical path for very high issue processors. The problem of scalability in
the issue logic is also identified by [Zhong et al., 2005]. They suggested a distributed issue logic for
a clustered VLIW processor.
In other architectural paradigms also, like transport triggered architecture [Corporaal, 1999],
steam processors [Khailany et al., 2003] and multiprocessor [Taylor et al., 2001] scalability issues
have been studied.
90
6.2 Experimental Setup
Benchmark Description ILPMatrix32 Matrix multiplication -unroll factor 32 18.29Convolution Convolution error control codes 17.48Hamming Hamming error control codes 28.96DCT2 DCT kernel with unroll factor of 2 13.4Sobel A 3x3 edge filter for images 10.96Colorspace RGB to YUV conversion for images 16.76
Table 6.1: High ILP Benchmark details
6.2 Experimental Setup
Processors with different number of issue slots, ranging from 2 to 16 were experimented with. A
register file of depth 64 is assumed for all experiments. The remaining experimental settings are
same as in Chapter 5. Three reduced port RF configurations are used for each processor, that is, N
read-N write, N read-3/4N write and N/2 read, N/2 write port RF. Hereafter, these configurations
will be called as config1, config2, and config3, respectively. These configurations are compared
with full port register file, that is, 2N read, N write port RF, referred to as config0. Buffer depth is
fixed at 2 for all experiments.
To do this study we choose high ILP benchmarks shown in Table 5.2. These benchmarks are
further divided into high ILP benchmarks and medium ILP benchmarks. Benchmarks having
achievable-S ILP more than 8 are classified as high ILP benchmarks, and those having achievable-S
ILP more than 4 but less than 8 are classified as medium ILP benchmarks. This classification is
based on issue width of the processor we are experimenting with. High ILP benchmarks have ei-
ther comparable achievable-S ILP or higher achievable-S ILP than machine parallelism offered by
processor of issue width 2 to 16. Details of ILP and benchmarks are given in Table 6.1 and 6.2.
91
6 Varying Issue Width and Scalability
Benchmark Description ILPMatrix8 Matrix multiplication(unroll factor 8) 4.63Viterbi Convolution error control decoder 4.78Susan An image processing filter 4.29DCT DCT kernel 7.44Rijndael Encryption algorithm 4.4
Table 6.2: Medium ILP Benchmark details
6.3 Performance
Performance of a processor can be defined in various ways. For example, number of execution
cycles can be one metric while total time spent by the application (number of execution cycles
multiplied by clock period) can be another. Performance in terms of execution cycles does not take
into account clock period; in other words, it assumes clock period is constant.
The processor with full ported RF does not have any port conflicts, therefore, will always have
lower number of execution cycles than reduced port RF configurations. We compare the two to
observe the performance loss in terms of number of cycles. Performance is defined as inverse of the
number of cycles. We normalized the performance by a performance of single issue processor. The
values are averaged for the set of benchmarks.
Figure 6.1(a) shows the performance in terms of number of cycles for high ILP benchmarks. We
observe that config0, config1 and config2 have very similar performance. Config1 has a perfor-
mance loss in the range of 1-8%, with maximum being 6 issue width and minimum for 16 issue
width. Config2 has slighly more performance loss with respect to config0. For example it is 1.5%
for 16 issue width and 14% for 6 issue width processor. Config3 suffers higher performance loss
as the number of issue slots increases. Config3 suffers loss in the range of 32-53% with respect to
config0.
92
6.3 Performance
The lower performance loss for high issue processors is attributed to the fact that percentage num-
ber of SIRO reads increases with the issue width. Percentage number of SIRO reads at a particular
issue slot is the function of ILP present in the applications. Due to higher ILP in the benchmark
applications, percentage number of operands from SIRO buffers is higher at higher issue widths.
For medium ILP benchmarks, config1 and config2 have maximum performance loss with respect
to config0 for 2 issue processor, that is, 9%. For the remaining cases congif1 and config2 exhibit
less than 9% performance loss. Config1 always performs slightly better, but the difference is not
significant. For medium ILP benhcmarks also, config3 suffers a loss of upto 30% for 2 issue and 4
issue processors.
From these experiments we can conclude that our proposed RF architecture may increase the
number of execution cycles by 10%, if port reduction is not drastic, which is quite acceptable.
We now focus on performance in terms of total execution time. Total time taken by an application
is the product of number of cycles and cycle time. Cycle time is a complex function of various
pipeline stages of processor. As the number of issue increases in the processor, the register file
access time increases. Though cycle time is not determined by a single pipeline stage, we assume
that cycle time depends on RF access time. The assumption gives an approximate effect of increased
RF access time on overall performance.
Overall performance - inverse of product of number of cycles and RF access time is shown in
figure 6.2(a) for high ILP benchmarks. RF access time increases almost linearly from single issue
processor to 16 issue processor, and for 16 issue processor it is almost double the single issue
processor in config0, while for config1 and config2 the increase is less. Due to combined effect of
number of cycles and RF access time, we observe that config0 saturates in overall performance after
issue width of 12 and starts decreasing for issue widths greater than 14. Config1 and config2 perform
better than config0 for issue widths more than 8. Overall, config2 performs the best. Config3, due
to higher increase in cycles, does not perform better even if cycle time is taken into account. It
performs better than config1 and config2 for 16 issue slot processor.
For medium ILP applications, the processor with no port reduction performs the best for 2 issue
93
6 Varying Issue Width and Scalability
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16
Nor
mal
ized
per
form
ance
(no
. of c
ycle
s)
Issue slots
RF:2N,N (Config0)RF:N,N (Config1)
RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)
(a) Performance of high ILP benchmarks.
0
1
2
3
4
5
6
7
0 2 4 6 8 10 12 14 16
Nor
mal
ized
per
form
ance
(no
. of c
ycle
s)
Issue slots
RF:2N,N (Config0)RF:N,N (Config1)
RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)
(b) Performance of medium ILP benchmarks.
Figure 6.1: Performance for different issue width processors.
94
6.3 Performance
1
2
3
4
5
6
7
8
9
0 2 4 6 8 10 12 14 16
Nor
mal
ized
per
form
ance
(cy
cles
*RF
acc
ess
time)
Issue slots
RF:2N,N (Config0)RF:N,N (Config1)
RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)
(a) Normalized cycle-delay product of high ILP benchmarks.
1
2
3
4
5
6
7
8
9
0 2 4 6 8 10 12 14 16
Nor
mal
ized
per
form
ance
(cy
cles
*RF
acc
ess
time)
Issue slots
RF:2N,N (Config0)RF:N,N (Config1)
RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)
(b) Normalized cycle-delay product of medium ILP benchmarks.
Figure 6.2: Normalized cycle-delay product for different issue widthprocessors
95
6 Varying Issue Width and Scalability
width. Config1 performs the best for processor with issue width 4 and 6. For config2 and config3
performance is similar for 8 issue width processor. Config3 is the best for issue width higher than
or equal to 8. This happens because performance in terms of number of cycles (Fig. 6.1) does not
improve beyond 8 issue processor and access time of RF dominates if issue width of processor is
more than 8.
6.4 Energy
RF energy is computed as read and write energy for all RF reads and writes. Leakage energy also
contributes to total RF energy which depends on total time of application and, therefore,the number
of cycles. There are three factors affecting the total RF energy – first, increase in read and write
energy per access; second, the number of SIRO reads and avoided writes; third, the number of cycles
which affects the leakage energy. Note that the total number of reads and writes (sum of RF reads
and SIRO reads) is constant for a given application.
The first factor increases while the second and third factors decrease the RF energy with increase
in the number of issue slots. Figure 6.3 shows the normalized RF energy of different issue proces-
sors for different configurations. RF energy corresponding to config0 is less than conventional RF
due to the avoided reads and writes. Config1, config2, and config3 have reduced RF energy per
access. Further, the number of avoided RF reads and writes due to SIRO buffers is more than that
for config0. Higher number of SIRO reads in reduced port configurations are present to minimize
performance penalty due to reduced ports. Due to increased SIRO reads and writes and the reduced
RF access energy, total energy at higher issue slots is much lower than the config0 energy. Leak-
age factor is not prominent enough to change the behavior of the total energy, because of similar
performance of the three type of configurations.
For high ILP benchmarks and medium ILP benchmarks we do not see much difference in behav-
ior of the three reduced port configurations.
The above result does not contradict the result of [Rixner et al., 2000] which says that the power
consumption of register file increases as N3 with increase in the number of issue slots. They suggest
96
6.5 Clustered VLIW and Scaling
three factors contributing to it – first, increase in RF energy per access with increased ports; sec-
ond, increase in number of registers with increased parallelism; and third, more parallel operations.
We assume the same number of registers in register file for all issue processors. Thus power will
increase with order of only N in our case.
6.5 Clustered VLIW and Scaling
As discussed before, clustering is the default scaling approach in high issue width VLIW architec-
tures. In this section we show that reduced port architecture is more effective and efficient in most
cases — both in terms of performance as well as energy.
For this study we use 4 issue, 8 issue, 12 issue and 16 issue processors with 2 clusters and 4
clusters and compare them with our proposed reduced port configuration, that is, N read ports and
3/4N write ports. FUs in each cluster are assumed to be uniform. The interconnection mechanism
in clustered architecture is bus based. Total number of function units available in any clustered
processor is same as the FUs available in corresponding reduced port configuration or monolithic
RF configuration. The application set is the same as that used for other experiments in this chapter.
As the compiler framework for clustered VLIW and the proposed reduced port architecture are
different, we normalized the performance with respect to the number of cycles in the architecture
with monolithic RF. From Fig. 6.4, it is clear that for high ILP applications except 4 issue and 8
issue processor with 2 clusters, reduced port architecture performs better than clustering. In case
of 2 cluster 8 issue processor the performance difference between reduced port architecture and
clustered architecture is marginal. In case of 4 clusters only 4 issue processor performs better than
reduced port architecture and that too marginally.
Clustering performed better in case of 4 issue width processor due to its approach of utilizing
slack for inter-cluster move operations. In high ILP applications, there is a large slack available for
scheduling due to less available resources. This slack is utilized for inter-cluster move operations.
Therefore, in spite of 21% operations being inter-cluster move operations, the number of execution
cycles increased by only 2%. The reduced port architecture, which banks on the availability of
97
6 Varying Issue Width and Scalability
0
0.5
1
1.5
2
2.5
0 2 4 6 8 10 12 14 16
Nor
mal
ized
RF
ene
rgy
Issue slots
RF:2N,N (Config0)RF:N,N (Config1)
RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)
(a) Total RF energy of high ILP benchmarks.
1
1.5
2
2.5
3
3.5
4
4.5
5
0 2 4 6 8 10 12 14 16
Nor
mal
ized
RF
ene
rgy
Issue slots
RF:2N,N (Config0)RF:N,N (Config1)
RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)
(b) Total RF energy for medium ILP benchmarks.
Figure 6.3: Total RF energy for different issue width processors
98
6.5 Clustered VLIW and Scaling
operands from the SIRO buffers, gets less number of operands from SIRO buffers when ILP of the
application is large and available FUs are much less.
Due to the above reason, for medium ILP applications, reduced port architectures always perform
better than clustered RF architectures.
�������� Monolithic RFProposed RF2 cluster4 cluster
������������������������������������������������������������������������
������������������������������������������������
������������������������������������������������
������������������������������������������������
������������������������������������������������������������������������
������������������������������������������������������������������������
���������������������������������������������������������
��������������������������������������
���������������������������������������������������������������
������������������������������������������
��������������������������������������������
��������������������������������������������
����������������������������������������������
����������������������������������������������
����������������������������������������������
����������������������������������������������
������������������������������������������
������������������������������������������
������������������������������������������������������������
����������������������������������������
������������������������������������
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
""""""""""""""""""""""""""""""
################
$$$$$$$$$$$$$$$$
%%%%%%%%%%%%%%%%
&&&&&&&&&&&&&&&&
''''''''''''''''''''''''
((((((((((((((((((((((((
0
0.2
0.4
0.6
0.8
1
16−issue12−issue8−issue4−issue
Nor
mal
ized
Per
form
ance
Processors configurations ))**+,-./0 Monolithic RF
Proposed RF2 cluster4 cluster
111111111111111111111111111111111111111111111111111111111111111
222222222222222222222222222222222222222222222222222222222222222
333333333333333333333333333333333333333333
444444444444444444444444444444444444444444
555555555555555555555555555555555555555555
666666666666666666666666666666666666666666
777777777777777777777777777777777777777777777777777777777777777
888888888888888888888888888888888888888888888888888888888888888
999999999999999999999999999999999999999999999999999999999999999
::::::::::::::::::::::::::::::::::::::::::
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
==========================================
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
??????????????????????????????????????????
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
IIIIIIIIIIIIIIIIIIIIIIIIIII
JJJJJJJJJJJJJJJJJJJJJJJJJJJ
KKKKKKKKKKKKKKKKKK
LLLLLLLLLLLLLLLLLL
MMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNN
OOOOOOOOOOOOOOOOOOOOOOOOOOO
PPPPPPPPPPPPPPPPPPPPPPPPPPP
0
0.2
0.4
0.6
0.8
1
1.2
16−issue12−issue8−issue4−issue
Nor
mal
ized
Per
form
ance
Processors configurations
Figure 6.4: Normalized performance for clustered VLIW processors
Apart from the performance loss in clustered VLIW processors there is an increase in the number
of operations due to inter-cluster moves. We observe that the number of inter-cluster move opera-
tions vary from 15 to 35% of total operations. In other words, there is an 18 to 50% increase in the
number of operations. Increase in the number of operations has a direct impact on the total energy
of a VLIW proecssor. Therefore, in respect of energy as well, the clustered architectures are not
99
6 Varying Issue Width and Scalability
favourable.
6.6 Summary
From different experiments we conclude that on the whole, our proposed architecture performs
better than conventional RF architecture for higher issue widths. Increase in the number of cycles
is also within the acceptable limits of 10%. The proposed architecture is also more scalable than
the conventional architecture in terms of energy. In comparison to clustered RF architecture, the
proposed architecture is better in terms of performance as well as energy at higher issue widths.
100
7 Conclusions and Future Work
7.1 Contributions and Major Results
In this thesis we worked towards energy reduction and scalability of multiported register files in
VLIW processors. We proposed novel architectures for the register files and also proposed compiler
algorithms for these architectures. Contributions of this thesis are summarized below:
• We proposed local buffer based RF architectures that explore the possibilities of RF read and
write reduction. We studied and critically analyzed the architectural implications of RISO,
SIRO, and RIRO based local buffers. We showed that with SIRO architecture, it is easy to
increase the bypass depth without much increase in the delay or effect on clock period. Apart
from saving energy due to lower accesses of second level RF, the approach saves upto 4% of
the processor core area due to simpler bypass control logic.
• We proposed reduced port RF architecture for VLIW processors. We studied different RF-
FU interconnects and argued in favor of the direct interconnections. Though the complete
interconnection topology avoids all path conflicts, its hardware complexity is higher. For
direct interconnection architecture, we showed how to choose an appropriate interconnection
matrix in order to minimize path conflicts.
• We proposed scheduling and binding algorithms that (a) increase the number of reads and
writes from SIRO and reduce reads and writes from second level RF, (b) avoid performance
loss due to reduced port second level RF by considering reduced traffic to the RF, and (c)
avoid performance loss due to direct interconnect by doing intelligent FU binding. It has
101
7 Conclusions and Future Work
been shown in the results that using our approach the number of reads and writes in SIRO
buffers increases significantly. Performance loss due to port reduction and direct interconnect
is within 5% for Mediabench and Mibench set of benchmarks.
• We proposed two theoretical models for predicting the performance and energy of the pro-
posed architecture. One model takes the number of read and write ports of RF as inputs and
assumes issue width of the processor is fixed. The second model takes the number of issue
slots, the number of read ports and write ports of RF as inputs. Both models take application
characteristics as inputs. Along with performance and energy estimates, the models also give
insight into factors that affect performance and energy.
• We implemented the scheduling and binding algorithms in Trimaran compiler framework.
Our experiments show that 40% of RF energy can be saved due to SIRO buffers, and upto
70% of RF energy can saved with reduced port RF architecture in a 4 issue processor. Our
experiments with different issue width processors show that the proposed RF architecture is
scalable in terms of performance as well as energy.
7.2 Future Work
In this thesis we show the scalability of the register file in a high issue VLIW processor. This
work can be extended by considering other components of VLIW processors for scalability. Most
important of them is interconnect. FU-FU interconnects form the bypass network. In very high
issue VLIW processors they can become the bottleneck. FU-FU interconnects can be localized, i.e.,
only the physically neighboring FUs may be connected to each other. This type of topology is called
partial bypass and has been studied as an architectural constraint. Partial bypass with reduced port
may give higher scalability, and, therefore, can be studied in this new context.
Another possible extension of the work is, combining reduced port RF architecture and clustered
RF architecture. In reduced port clustered architecture ports of each RF bank will be reduced. This
architecture will be suitable for scaling the processor beyond 16 issue processor.
102
7.2 Future Work
The register allocation algorithm can also be integrated with the proposed scheduling and binding
algorithms for further exploration. Integration of register allocation will be extremely important if
one wants to explore “random in random out”, RIRO based RF architecture.
The techniques proposed in thesis can be extended for dynamically scheduled multi-issue pro-
cessors. In that case, the port and path conflict management has to be done by hardware at run time.
It would be interesting to investigate the application of complier techniques in that case.
103
7 Conclusions and Future Work
104
References
Silicon hive. http://www.siliconhive.com.
Tilera. http://www.tilera.com.
S. Aditya, B. R. Rau, and V. Kathail. Automatic architectural synthesis of VLIW and EPIC proces-
sors. In International Symposium on System Synthesis, pages 107–113, 1999.
A. Aggarwal and M. Franklin. Energy efficient asymmetrically ported register files. In International
Conference on Computer Design, pages 2 – 7, 2003.
P. Ahuja, D. Clark, and A. Rogers. The performance impact of incomplete bypassing in processor
pipelines. In Proceedings. 28th Annual International Symposium on Microarchitecture, pages
36–45, 1995.
K. Asanovic, M. Hampton, R. Krashinsky, and E. Witchel. Power Aware Computing. Kluwer
Academic/Plenum Publishers, June 2002.
J. Ayala, M. Lopez-Vallejo, and A. Veidenbaum. A compiler-assisted banked register file architec-
ture. In IEEE Workshop on Application Specific Processors, 2004.
A. Baghdadi, N. Zergainoh, W. Cesario, T. Roudier, and A. Jerraya. Design space exploration
for hardware/software codesign of multiprocessor systems. In Proceedings. 11th International
Workshop on Rapid System Prototyping, pages 8–13, 2000.
105
REFERENCES
M. Balakrishnan and H. Khanna. Allocation of fifo structures in RTL data paths. ACM Transactions
on Design Automation of Electronic Systems (TODAES), 5(3), 2000.
R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. Reducing the complexity of the register
file in dynamic superscalar processors. In Proceedings. 34th Annual International Symposium on
Microarchitecture, pages 237 – 248, 2001.
A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis
of tradeoffs. In Proceedings. 25th Annual International Symposium on Microarchitecture, 1992.
L. N. Chakrapani, J. Gyllenhaal, W. mei W. Hwu, S. A. Mahlke, K. V. Palem, and R. M. Rabbah.
Lecture Notes in Computer Science, volume 3602/2005, chapter Trimaran: An Infrastructure for
Research in Instruction-Level Parallelism, pages 32 – 41. Springer Berlin / Heidelberg, 2005.
P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W.-m. W. Hwu. IMPACT: an architectural
framework for multiple-instruction-issue processors. SIGARCH Comput. Archit. News, 19(3):
266–275, 1991. ISSN 0163-5964.
H. Corporaal. TTAs: Missing the ILP complexity wall. Journal of Systems Architecture, 36(12),
1999.
J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. Multiple-banked register file architectures.
In International Symposium on Computer Architecture, pages 316–325, 2000.
O. M. D’Antona and E. Munarini. A combinatorial interpretation of punctured partitions. Journal
of Combinatorial Theory Series A, 91(1-2):264 – 282, 2000.
J. Davidson and S. Jinturkar. Improving instruction-level parallelism by loop unrolling and dynamic
memory disambiguation. In Proceedings. 28th Annual International Symposium on Microarchi-
tecture, 1995.
O. Ergin, D. Balkan, K. Ghose, and D. Ponomarev. Register packing: Exploiting narrow-width
106
REFERENCES
operands for reducing register file pressure. In Proceedings. 37th Annual International Sympo-
sium on Microarchitecture, pages 304– 315, 2004.
K. Fan, N. Clark, M. Chu, K. Manjunath, R. Ravindran, M. Smelyanskiy, and S. Mahlke. Systematic
register bypass customization for application-specific processors. In IEEE 14th International
Conference on Application-specific Systems, Architectures and Processors (ASAP), June 2003.
K. Fan, M. Kudlur, H. Park, and S. Mahlke. Cost sensitive modulo scheduling in a loop accelerator
synthesis system. In Proceedings. 38th Annual International Symposium on Microarchitecture,
2005.
P. Faraboschi, G. B. abd Joseph A Fisher, G. Desoli, and F. Homewood. Lx: A technology plat-
form for customizable VLIW embedded processing. In Proceeding of the 27th International
Symposium on Computer Architecture, pages 203–213, June 2000.
K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The multicluster architecture: reducing cycle time
through partitioning. In Proceedings. 30th Annual International Symposium on Microarchitec-
ture, pages 149–159, Dec 1997.
M. Fernandes. A clustered VLIW architecture based on queue register files. PhD Thesis, University
of Edinburgh, 1998.
J. Fridman and Z. Greenfield. The tigersharc dsp architecture. In IEEE Micro, pages 66–76, Jan-Feb
2000.
A. Gangwar. A Methodology For Exploring Communication Architectures of Clustered VLIW Pro-
cessors. PhD thesis, Department of Computer Science, IIT Delhi, 2005.
A. Gangwar, M. Balakrishnan, and A. Kumar. Impact of intercluster communication mechanisms
on ilp in clustered VLIW architectures. ACM Transactions on Design Automation of Electronic
Systems (TODAES), 12(1), 2007.
107
REFERENCES
R. Gonzalez, A. Cristal, D. Ortega, A. Veidenbaum, and M. Valero. A content aware integer register
file organization. In International Symposium on Computer Architecture, 2004a.
R. Gonzalez, A. Cristal, M. Pericas, A. Veidenbaum, and M. Valero. Scalable distributed register
file. In WCED, 2004b.
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench:
A free, commercially representative embedded benchmark. In IEEE 4th Annual Workshop on
Workload Characterization, Dec. 2001.
J. Hoogerbrugge and H. Corporaal. Register file port requirements of transport triggered archi-
tectures. In Proceedings. 27th Annual International Symposium on Microarchitecture, pages
191–195, 1994.
N. P. Jouppi. The nonuniform distribution of instruction-level and machine parallelism and its effect
on performance. IEEE Trans. Comput., 38(12):1645–1658, 1989.
V. Kathail, M. Schlansker, and B. R. Rau. HPL-PD architecture specification: Version 1.1. Technical
Report HPL-93-80R1, 2000.
R. Kesseler. The alpha 21264 microprocessor. IEEE Micro, pages 24 – 36, March-April 1999.
B. Khailany, W. Dally, S. Rixner, U. Kapasi, J. Owens, and B. Towles. Exploring the VLSI scala-
bility of stream processors. In International Conference on High Performance Computer Archi-
tecture, 2003.
N. S. Kim and T. Mudge. Reducing register ports using delayed write-back queues and operand
pre-fetch. In International Conference on Supercomputing, pages 172–182, 2003.
M. Kondo and H. Nakamura. A small, fast and low-power register file by bit-partitioning. In
International Symposium on High Performance Computer Architecture, 2005.
108
REFERENCES
M. Kudlur, K. Fan, M. Chu, R. Ravindran, N. Clark, and S. Mahlke. Flash: Foresighted latency-
aware scheduling heuristic for processors with customized datapaths. In CGO ’04: Proceedings
of the international symposium on Code generation and optimization, pages 201 – 212, 2004.
A. Lambrechts, P. Raghavan, A. Leroy, G. Talavera, T. V. Aa, M. Jayapala, F. Catthoor, D. Verk-
est, G. Deconinck, H. Corporaal, F. Robert, and J. Carrabina. Power breakdown analysis for a
heterogeneous NoC platform running a video application. In IEEE International Conference on
Application-Specific Systems, Architecture Processors, pages 179–184, 2005.
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating and synthesiz-
ing multimedia and communications systems. In International Symposium on Microarchitecture,
pages 330–335, 1997.
J. Llosa, M. Valero, and E. Ayguade. Non-consistent dual register files to reduce register pressure.
In International Symposium on High-Performance Computer Architecture, page 22, 1995.
J. Llosa, M. Valero, J. Fortes, and E. Ayguade. Using sacks to organize register files in VLIW
machines. In CONPAR, 1994.
S. Mahlke, W. Chen, J. Gyllenhaal, W. Hwu, P. Chang, and T. Kiyohara. Compiler code transfor-
mations for superscalar-based high-performance systems. pages 808–817, Nov 1992.
C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. IEEE Micro, pages 44–55, 2003.
R. Nalluri, R. Garg, and P. R. Panda. Customization of register file banking architecture for low
power. In International Conference on VLSI Design and Embedded Systems, pages 239–244,
2007.
D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance. In
Proceedings. 27th Annual International Symposium on Microarchitecture, pages 52–62, 1994.
E. Ozer, S. Sathaye, K. Menezes, S. Banerjia, M. Jennings, and T. Conte. A fast interrupt handling
109
REFERENCES
scheme for VLIW processors. In Proceedings of International Conference on Parallel Architec-
tures and Compilation Techniques, pages 136–141, Oct 1998.
S. Palacharla, N. Jouppi, and J. Smith. Quantifying the complexity of superscalar processors. Tech-
nical Report, CS-96-1328, University of Wisconsin and Madision, November 1996.
S. Palacharla, N. P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In Pro-
ceedings. 24th annual International Symposium on Computer Architecture, pages 206–218, 1997.
I. Park, M. D. Powell, and T. N. Vijaykumar. Reducing register ports for higher speed and lower
energy. In Proceedings. 35th Annual International Symposium on Microarchitecture, pages 171–
182, 2002.
S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie. Bypass aware instruction
scheduling for register file power reduction. In Proceedings of the conference on Language,
compilers, and tool support for embedded systems, pages 173–181, 2006.
M. Pericas, R. Gonzalez, A. Cristal, A. Veidenbaum, and M. Valero. An optimized front-end phys-
ical register file with banking and writeback filtering. In Workshop on Power Aware Computer
System, 2004.
G. Reinman. Using an operand file to save energy and to decouple commit resources. IEE Proceed-
ings of Computer and Digital Techniques, 152(5), 2005.
S. Rixner, W. J. Dally, B. Khailany, P. R. Mattson, U. K. Kapasi, and J. D. Owens. Register or-
ganization for media processing. In International Symposium on High Performance Computer
Architecture, pages 375–386, 2000.
M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, and R. Zafalon. Low-power data forwarding for VLIW
embedded architectures. IEEE Transaction of VLSI Systems, 10(5):614–622, 2002.
R. Sangireddy. Register organization for enhanced on-chip parallelism. In International Conference
on Application-specific Systems, Architectures and Processors (ASAP), 2004.
110
REFERENCES
R. Sangireddy. Register port complexity reduction in wide-issue processors with selective instruc-
tion execution. Microprocessors and Microsystems., 31(1):51–62, 2007.
R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. Rau, D. Cronquist, and M. Sivaraman. Cost
sensitive modulo scheduling in a loop accelerator synthesis system. The Journal of VLSI Signal
Processing, 31(2), 2002.
N. Seshan. High VelociTI processing. IEEE Signal Processing Magazine, 15(2):88–101, 1998.
T. Shiota, K. Kawasaki, Y. Kawabe, W. Shibamoto, A. Sato, T. Hashimoto, F. Hayakawa, S. Tago,
H. Okano, Y. Nakamura, H. Miyake, A. Suga, and H. Takahashi. A 51.2 GOPS 1.0 GB/s-DMA
single-chip multi-processor integrating quadruple 8-way VLIW processors. In IEEE Interna-
tional Solid-State Circuits Conference, volume 1, pages 194 –593, Oct. 2005.
A. Shrivastava, N. Dutt, A. Nicolau, and E. Earlie. PBExplore: A framework for compiler-in-the-
loop exploration of partial bypassing in embedded processors. In DATE ’05: Proceedings of the
conference on Design, Automation and Test in Europe, pages 1264–1269, 2005.
S. Sirsi and A. Aggarwal. Exploring the limits of port reduction in centralized register files. In 22nd
International Conference on VLSI Design and Embedded system, pages 535–540, Jan. 2009.
K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-
aware microarchitecture: Modeling and implementation. ACM Trans. Archit. Code Optim., 1(1):
94–125, 2004.
H.-J. Stolberg, M. Berekovic, S. Moch, L. Friebe, M. B. Kulaczewski, S. Flugel, H. Klußmann, ,
A. Dehnhardt, and P. Pirsch. HiBRID-SoC: A multi-core SoC architecture for multimedia signal
processing. 41(1):9 – 20, August 2005.
D. Tarjan, S. Thoziyoor, and N. P. Jouppi. CACTI 4.0. Technical Report HPL-2006-86, HP Labs,
2006.
111
REFERENCES
M. Taylor, J. Kim, J. Miller, F. Ghodrat, B. Greenwald, P. Johnson, W. Lee, A. Ma, N. Shnidman,
V. Strumpen, D. Wentzlaff, M. Frank, S. Amarasinghe, and A. Agarwal. The raw processor - a
scalable 32-bit fabric for embedded and general purpose computing. In proceedings of Hotchips,
August 2001.
A. Terechko, M. Garg, and H. Corporaal. Evaluation of speed and area of clustered VLIW proces-
sors. In Internation Conference on VLSI Design, 2005.
J. Tseng and K. Asanovic. Banked multi port register file for high frequency superscaler micro-
processors. In Proceedings of 30th International Symposium on Computer Architecture, pages
62–71, June 2003.
J. W. van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J. P. van Itegem,
D. Amirtharaj, K. Kalra, P. Rodriguez, and H. van Antwerpen. The TM3270 media-processor. In
International Conference on Microarchitecture, pages 331–342, 2005.
J. T. J. van Eijndhoven, F. W. Sijstermans, K. A. Vissers, E. Pol, M. I. A. Tromp, P. Struik, R. Bloks,
P. van der Wolf, A. Pimentel, and H. Vranken. TriMedia CPU64 architecture. In International
Conference on Computer Design, pages 586–592, 1999.
S. J. E. Wilton and N. P. Jouppi. CACTI: an enhanced cache access and cycle time model. IEEE
Journal of Solid State Circuits, 31:677–688, 1996.
J. Yan and W. Zhang. Exploiting virtual registers to reduce pressure on real registers. ACM Trans.
Archit. Code Optim., 4(4):1–18, 2008.
K. Yeager. The Mips R10000 superscalar microprocessor. Micro, IEEE, 16(2):28–41, Apr 1996.
J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Two-level hierarchical register file organization for
VLIW processors. In Proceedings. 33rd Annual International Symposium on Microarchitecture,
pages 137–146, 2000.
112
REFERENCES
H. Zhong, K. Fan, S. Mahlke, and M. Schlansker. A distributed control path architecture for VLIW
processors. In International Conference on Parallel Architectures and Compilation Techniques
(PACT), 2005.
V. Zyuban and P. Kogge. The energy complexity of register files. In International Symposium on
Low Power Electronics and Design, pages 305–310, 1998.
113
REFERENCES
114
List of Publications
• Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda. Power Reduction in VLIW Processor with
Compiler Driven Bypass Network. Internation Conference on VLSI Design and Embedded
System, pages 233–238, 2007.
• Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda. Shared Port Register File Architecture for
Low Energy VLIW Processors. Under submission.
• Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda. Low Energy and scalable VLIW Processor
with Two Level Register File. Under submission.
115
116
Brief Bio-data
Neeraj Goel has received B. Tech. degree from NIT Kurukshetra in Electronics and Communication
in 2002 and M.Tech. in VLSI Design Tools and Technology from IIT Delhi in 2004. His broad
research interest includes embedded processors (like VLIWs) and their tools and compilers; FPGAs
and reconfigurable computing.
117