(DFP) unit
Advisor: Yifat Manzor
October 2011 " ,
2
Abstract:
Over the past few years, there has been a growing interest in the
development of
Decimal Floating Point Units (DFPU) due to precision and timing
constraints. We
present a summarized IEEE standard in order to comply with Standard
limitations.
We then present our design for a DFPU Framework available for
expansion, which
performs basic DFP operations, the most complex of which are
Addition/Subtraction.
These operations were designed with "High Performance over Silicon
Cost" in mind.
Algorithmic simulation schemes are presented as well as our
low-level design
verification process. Finally, our synthesis results are
presented.
3
Introduction
It is our pleasure to present the following project book. This
project book is the
product of a year's work of research and development.
The Research was performed using an array of tools. We employed the
academic
knowledge gained throughout our degree in a variety of courses in
several fields:
Logic Design - Digital Logic Circuitry.
Architecture - Computer Architecture, MicroComputer and Assembly
Language.
Arithmetic - Computer Arithmetic Algorithms.
We would be remiss if we did not mentioned the wealth of
information gained from
taking advantage of Prof. Mike Cowlishaw's 'Speleotrove' website.
1
Software tools used included: MATLAB, Cadence Simvision, Xilinx ISE
Design
Suite 13.2.
Hardware implementation was achieved on Virtex®-6 FPGA ML605
Evaluation
Board.
For verification purposes we used our own Test Bench, IBM FPgen
Testing Suite 2,3
and made use of Prof. Mike Cowlishaws' test vectors.
Ariel Burg
Hillel Rosensweig
2.2. Infinity and NaNs
...........................................................................................14
4. Simulation
...............................................................................................................32
4.3. Exponent Comparator Simulation
..................................................................33
4.4. Full Path Simulation
.......................................................................................34
5. Implementation
........................................................................................................36
6.4. Pipeline Hazards
.............................................................................................57
7.1. Verification Properties
...................................................................................58
8. Synthesis
..................................................................................................................61
Floating Point (DFP) unit
1. Overview
1.1 Purpose
The objective of this project is to design, implement and test a
decimal floating-point
arithmetic unit, based on formats and methods for floating-point
arithmetic as
specified in IEEE 754-2008 standard.
1.2 Motivation
Currently, most arithmetic hardware units perform operations on
numbers in binary
format. As the most basic memory unit ('bit') is itself binary,
binary arithmetic
implementations would be the natural and intuitive choice. Despite
this, drawbacks of
binary arithmetic implementations have created renewed interest in
developing
arithmetic units, capable of performing operations on numbers set
in a decimal
format:
Speed: As opposed to computers, users prefer the use of the decimal
notation as
opposed to the binary one. On certain applications the need for
decimal to binary
conversions and binary to decimal conversions is so great that
decimal operations
require 50%-90% of processor time. A system with a direct decimal
representation
and hardware support would save this overhead.
Accuracy: With floating-point numbers in binary format, accuracy
problems
prevail. For instance, the decimal term 0.1 has no finite binary
representation:
Dec: 0.1 = Bin: 0.0001100110011....
Due to limited memory, a 32 bit representation will truncate the
infinite edge and
round up, leading to obvious accuracy errors.
7
. For instance, the following
C program (compiled with Visual C++):
for (i=0.1; i<0.5; i=i+0.1) printf ("%f\n",100000000*i);
will print out:
Similarly, using C (compiled with Visual C++), the following two
loops will not
run the same amount of iterations due to rounding errors:
The loop:
printf ("%f\n",num);
printf ("%f\n",num);
One important example of the implications of such an error occurred
in the Gulf
War:
8
On February 25, 1991, during the Gulf War, an American Patriot
Missile battery
in Dharan, Saudi Arabia, failed to track and intercept an incoming
Iraqi Scud
missile. Specifically, the time in tenths of second as measured by
the system's
internal clock was multiplied by 1/10 to produce the time in
seconds. This
calculation was performed using a 24 bit fixed point register. In
particular, the
value 1/10, which has a non-terminating binary expansion, was
chopped at 24 bits
after the radix point. The small chopping error, when multiplied by
the large
number giving the time in tenths of a second, led to a significant
error, and
consequently caused severe damage and human casualties.
Uniformity: Today's computation with floating-point numbers might
yield diverse
results in different processors.
Proposed solution
The proposed solution is to encode data in a decimal format: a
format which would
give a distinct representation for each digit, separate from other
digits. The format is
based on IEEE 754-2008 standard. This solution would solve all the
previously
mentioned problems:
Speed: Each digit has a unique representation in the encoding
scheme, so that
converting numbers to computer code becomes simply a direct
translation using
tables instead of a costly conversion between bases.
Accuracy: Each digit has a unique representation, therefore any
number that can
be expressed visually (within a given accuracy) by the user can be
expressed in
the decimal encoding precisely.
Uniformity: Using the format and methods specified in IEEE 754-2008
standard,
results of computation will be identical, independent of
implementation, given the
same input data. Errors, and error conditions, in the mathematical
processing will
be reported in a consistent manner regardless of implementation.
The arithmetic
unit designed in this project complies with IEEE754-2008
specifications for
decimal floating-point. Therefore, computations done in the
arithmetic unit will
yield the same results as in any other implementation which
complies with the
IEEE 754-2008 standard.
9
Note: because '10' is not a primary number, but is built of '2' and
'5', working in base
'10' provides a wider finite representation range, as opposed to
working in base '2',
where any fraction in base '5' cannot be represented in any finite
way.
For a detailed summary of decimal floating-point history solutions
see Appendix A –
DFP History.
The IEEE 754-2008 standard specifies two decimal floating-point
formats:
DEC128 - uses 128 bits for representation.
DEC64 - uses 64 bits for representation.
Although the DEC128 format provides a better precision, using the
DEC64 format
will gain more speed and provide a sufficient precision level for
various applications.
Therefore, for this project the chosen format is DEC64.
A general structure of a decimal floating-point number is:
SigvSigbiasExpSr biasExps 101,,,
where S is the Sign of the number, Exp (Exponent) is the integer
power to which the
radix (10) is raised, and Sig (Significand) - the digits that
comprise the significant
portion of the number. These are the three elements needed to
construct a floating
point number. Figure2.1 shows the 64 bit format for a decimal
floating point number.
1 bit MSB 13 bits LSB MSB 50 bits = 5declets LSB
S (sign)
Figure 2.1. DEC64 format for 64 bit decimal floating-point
number.
As figure 2.1 shows, each 64 bit operand is built of 3
fields:
S - Sign bit
G - Combination Field
T - Trailing Significand Field.
The three elements which construct a floating point number are
encoded in (S,G,T)
fields:
Sign
The Sign Field in the format represents the sign of the number
(sign=(-1) s )
Exponent
The exponent is one of the elements which comprise the Combination
Field.
00
The Exponent is 10 bit long in the range
[emin,emax]=[-383,384].
The Exponent is biased so that it will be represented with positive
values.
Therefore the bias is 383 and the biased Exponent range is
[0,767].
The Exponent is encoded entirely in the Combination Field.
Significand
The Significands' precision is 16 digits. In its decoded form, it
is 64 bits long
(BCD representation). In its coded form, it is split into 15 digits
encoded in the
field T (50 bits = 5x10 = 5declets), and an MSD (most significant
digit)
encoded in the combination field (G).
The format also supports representation of ±∞ and NaN (Not a
Number).
Decoding and Encoding the Combination Field
The combination field is encoded/decoded by using the 5 MSB's
(G4...G0). These five
bits hold the status of the number (Inf, NaN or finite) as well as
the Significands'
MSD and the two MSBs of the exponent (for finite numbers). The
remaining 8 bits
hold the remainder of the exponent. Encoding/Decoding of the
Combination field is
described in Table 2.1.
G0G1G2G3G4
0 c d e a b Finite a b c d e
1 0 0 e c d Finite 1 1 c d e
- - - - - - Infinity 1 1 1 1 0
- - - - - - NaN 1 1 1 1 1
Table 2.1. The first five bits of the Combination Field indicates
the
type of the number, the Significands' MSD and the 2 MSBs of
the
exponent (for finite numbers).
G5 differentiates between quiet NaNs (qNaN) and signaling NaNs
(sNaN),
where Signaling NaNs signal uninitialized variables and
arithmetic
enhancements that are not in the scope of the standard. Quiet NaNs
afford
retrospective diagnostic information inherited from invalid
operations.
02
Infinity:
vrGGGGG 11111043210
Finite numbers:
0 1 2 3 4 0 11 , , , 1 10s E biasG G G G G XXXX OR XXX r S E bias C
v C
Densely-Packed Decimal (DPD)
In order to allow for Decimal representations of numbers without
adding a memory
overhead to the implementation, significands are stored in a
Densely Packed Decimal
format.
Using DPD coding takes advantage of the BCD representation
redundancy.
Decoding 10-bit densely-packed decimal to 3 decimal digits
Decoding a Densely Packed Decimal declet is performed according to
Table 2.2:
____________________________________________________________________
b(9) b(8) b(7) b(6) b(5) b(4) b(3) b(2) b(1) b(0)
1 0 1 1 0 0 1 1 0 1
We use the appropriate table entry:
b(4) b(3) b(8) b(7) b(6)
0 1 0 1 1
Therefore:
d(1)=8+b(2)=8+1=9 ; d(2)= 4*b(3)+2* b(4)+b(5)= 4*1+2*0+0=4 ;
d(3)= 4*b(0)+2* b(1)+b(9)= =4*1+2*0+1=5
Therefore the decoded number is 945.
_____________________________________________________________________
Encoding 3 decimal digits to 10-bit Densely-Packed Decimal
Encoding Decimal numbers in Densely Packed Decimal format is done
using
Table 2.3.
_____________________________________________________________________
d(3) d(2) d(1)
3 2 1 0 3 2 1 0 3 2 1 0
1 1 0 0 0 0 0 1 0 1 1 0
04
d(1,0)= 0 ; d(2,0)= 1 ; d(3,0)= 0
We use the appropriate table entry:
Bits 1,2,3 in d(1) are 110, therefore b(0)b(1)b(2) = 110
Bits 1,2 in d(3) are 01, therefore b(3)b(4) = 01
Bit 3 in d(2) 0, therefore b(5)b(6)b(7)b(8)=0101
Bit 3 in d(3) is 1, therefore b(9) is 1.
The final encoding is:
_____________________________________________________________________
Note: using DPD (Densely Packed Decimal) coding, 15 BCD digits (60
bits) are
packed into 50 bits in field, taking advantage of the BCD
representation redundancy.
2.2 Infinity and NaNs
There are two different kinds of NaN, Signaling and Quiet.
Signaling NaNs (sNaN) represent uninitialized variables and other
unique
situations.
Quiet NaNs (qNaN) supply diagnostic information inherited from
invalid or
unavailable data and results.
qNaN Propagation
To allow propagation of the diagnostic information, as much
information as possible
should be preserved in NaN results of operations. In other words,
operations
performed on NaNs should preserve in the result as much of the
original NaN operand
as possible.
If two or more inputs are NaN, then the payload of the resulting
NaN should be
identical to the payload of one of the input NaNs if representable
in the destination
05
format. The standard does not specify which of the input NaNs will
provide the
payload.
qNaN Generation
In general, operations that signal an invalid operation exception
(see Para. 2.3) shall
generate a quiet NaN.
Infinity
The approach to infinites in floating-point arithmetic is
equivalent to the approach to
Overflow (see Para. 2.3). In general, an Overflow in the result
will itself raise an OVF
flag and the result will be coded as Infinity.
Operations on infinite operands usually don't signal exceptions and
return an Infinite
result (for infinite coding, see IEEE Standards section). This
applies to the following
operations:
Addition(∞, x), Addition(x, ∞), Subtraction(∞, x), or
Subtraction(x, ∞), for finite x.
The exceptions that do pertain to infinities are signaled (see
Para. 2.3) only when:
∞ is an invalid operand (in certain operations).
∞ is created from finite operands by overflow.
Subtraction of infinities, such as: Addition(+∞, −∞).
2.3 Exception Handling
Invalid operation 7.2.0
The invalid operation exception is signaled if and only if the
arithmetic operation
provides no useful result. The default result of an operation that
signals the invalid
operation exception shall be a quiet NaN that should provide some
diagnostic
information (see Para.2.2).
Addition or Subtraction of infinities, such as: Addition(+∞,
−∞).
06
Overflow 7.4.0
The overflow exception is signaled if and only if the result
format’s largest finite
number is exceeded in magnitude by what would have been the rounded
floating-
point result were the exponent range unbounded. The default result
shall be
determined by the rounding-direction attribute and the sign of the
intermediate result.
Specifically, in accordance to the DFPU rounding scheme -
roundTiesToEven - all
overflows are rounded to ∞ with the sign of the intermediate
result. In addition, under
default exception handling for overflow, the overflow flag shall be
raised and the
inexact exception shall be signaled.
Inexact 7.6.0
Unless stated otherwise, if the rounded result of an operation is
inexact - that is, it
differs from what would have been computed were both exponent range
and precision
unbounded - then the inexact exception shall be signaled. The
rounded or overflowed
result shall be delivered to the destination.
Note: underflow, divide by zero exceptions are included in the
standard, but were not
fully implemented in the current design as they are not necessary
in this context.
2.4 Normalizing & Rounding
When executing an instruction, the result operand should be
represented in a
normalized form, i.e. with no leading zeros.
Using the normalized form simplifies the comparison of two Decimal
Floating Point
operands. A normalized form allows finite operands (≠0) to have a
unique
representation, which is helpful for comparison: a larger exponent
indicates a larger
operand and significands should be compared only in case of equal
exponents.
Note: In case of comparison with 0 one should only check if
sig==0.
There are three possible Normalization scenarios in Addition and
Subtraction:
1. Significand ≥ 10, therefore the significand should be shifted to
the right and the
exponent should be increased by one (possible in case of
Addition).
2. 1 ≤ significand < 10. No shifting needed (possible in case of
Addition or
Subtraction).
07
3. Significand < 1, therefore the significand should be shifted
to the left and the
exponent should be decreased as long as there are leading zeros
(possible in case
of Subtraction).
The first case may lead to overflow, since increasing the exponent
may cause
exceeding the maximum exponent for a finite number.
The third case may lead to underflow, since decreasing the exponent
may cause
exceeding the minimum exponent for a finite number.
Shifting is done using Barrel Shifter, which fasten the
operation.
Rounding is done using roundTiesToEven attribute: The
floating-point number
nearest to the infinitely precise result shall be delivered; if the
two nearest floating-
point numbers bracketing an unrepresentable infinitely precise
result are equally near,
the one with an even least significant digit shall be
delivered.
Choosing this attribute gives an average rounding error = 0.
_____________________________________________________________________
Example 2.3.
If the exact result significand is 1.23456789012345678 (precision
is p=16
digits), then the returned significand should be
1.234567890123457
Example 2.4.
If the exact result significand is 1.23456789012345650, then the
returned
significand should be1.234567890123456
If the exact result significand is 1.23456789012345651, then the
returned
significand should be 1.234567890123457
Guard digit
Round digit
Sticky digit
08
If (R>5) or (R=5 and S≠0) or (R=5 and S=0 and LSD=odd number)
then the
significand is increased by 1, as can be seen in example 1,
3.
The Sticky Digit serves as a tie-breaker in the roundTiesToEven
attribute.
The role of the Guard Digit is to guard against loss of information
in case of post-
normalization (Scenario 2), as explained in the next proof.
Three Rounding Digits are sufficient when using roundTiesToEven
attribute.
_____________________________________________________________________
Proof: consider the three possible Normalization scenarios
mentioned above:
Case 1: In the worst case of this scenario the exponent difference
of the original
operands is 1 (see Para.3.3), i.e. one shift on pre-alignment, so
that there is a carry
out.
exponent difference = 1.
Post-normalization 1000000000000000 5 1
Therefore two extra digits are needed for rounding.
Case 2: No shifting is done. Therefore there is no need of rounding
digits.
Case 3: The significand is shifted to the left and the exponent is
decreased as long
as there are leading zeros. Let us concentrate on two possible
cases in this
scenario:
o The subtrahend is shifted more than one position to the right
(pre-
alignment).The difference has at most one leading zero => at
most one
shifted-out digit required for post-normalization.
Sticky Digit = 0 if all the rightmost shifted digits starting from
the 19 th
place are zero. If at least one of them is bigger than zero then
Sticky Digit
= 1.
09
For example: sigA = 1000000000000000, sigB =
9999999999994002,
exponent difference = 5. sigB is shifted 5 positions to the
right.
=> Digits in 19 th
G R S
A-B 0999900000000000 0 5 9
postnormalization 9999000000000000 5 9
rounding 9999000000000001
Note: The Sticky digit participates in subtraction only to generate
borrow.
After subtracting the aligned operands, the true value of the
rightmost result
digit is not important. What matters is if it is zero or not.
After post-normalization the Guard Digit serves as the Round Digit
and
Round Digit serves as the Sticky Digit.
Therefore three extra digits are needed for rounding.
o The subtrahend is shifted up to one position to the right
(pre-alignment); at
most one digit is pre-aligned out of the 16 digit range.
For example: sigA = 1200000000000000, sigB =
1000000000000004,
exponent difference = 1.
Therefore one extra digit is needed for rounding.
_____________________________________________________________________
After Normalizing and Rounding the result, another
post-normalization may be
needed (in case rounding lead to significand ≥ 10). Therefore
another Normalizing
component is set after the result is rounded.
21
3.1 General Operation Scheme
The general operation of the system is described in Figure
3.1.
Figure 3.1. General operation of the system. Shows the progress of
a command.
A Designated Compiler transfers DFPU commands to the correct 74 bit
format. It also
translates data (operands) to DEC64 format and creates DFPU
instructions for data
transfer into the DFPU register file. These commands are sent to
the CPU as payload
for a Load Word operation, which writes the commands to a
designated memory
segment in the RAM.
Upon writing DFPU commands in the designated memory segment, the
CPU
commands the DMAC (Direct Memory Access Controller) to load the
DFPU
commands to the internal DFPU memory. The CPU sends a 'go' signal
to the DFPU
(see Fig. 3.2) and the DFPU subsequently begins reading the
internal memory and
processing commands. Another form of communication from CPU to DFPU
is
through Interrupt request (see Fig. 3.2).
Upon completion of running DFPU commands, and upon certain
exception
occurrence (see Para. 2.3), an exception notice is sent to the
CPU.
Note: the Designated Compiler delivers numbers in a normalized
form.
20
3.2 Interface
The DFPU (Decimal Floating Point Unit) serves as a peripheral
computation unit.
Its interface includes four input signals (nrst, clk, go,
interrupt) and one output signal
(Exception). Figure 3.2 describes the DFPU interface.
Figure 3.2. The DFPU interface.
nrst –reset signal (negative reset).
clk – unit clock signal
interrupt – CPU signal to DFPU (e.g. soft reset)
Exception – DFPU feedback to CPU.
Exception signal is sent in the following cases:
Finished - DFPU completed performance of loaded tasks.
Invalid operation (see Para. 2.3).
Overflow (see Para. 2.3).
Underflow, Divide by Zero (see Para. 2.3, should be available in
future
designs - not necessary in this context).
Note: 'Inexact' signal does not raise interface exception flag, due
to the fact it is an
acceptable and regular condition.
Assuming that Addition/Subtraction is the most complicated
operation in the current
design, and it's implementation covers other, simpler operations
(negation, increment,
decrement) from both an arithmetic and architectural point of view,
the Arithmetic
algorithm was developed according to it.
For any addition/subtraction of a pair of standardized decimal
operands: A,B, the
following expansion is true:
Bsig
10 signifies a shifted Bsig by AB ExpExp
positions to the right (assume, without loss of generality, that AB
ExpExp ), and that
each significand has a limited precision, we can conclude that for
some operands,
where exponent difference exceeds significand precision,
addition/subtraction is
irrelevant. With all that in mind, an addition algorithm emerges
(Fig. 3.3).
The diagram in Figure 3.3 does not relate to the Sign bit in each
operand. The sign bit
is dealt with separately, and its main function is to determine the
type of operation
performed during Addition/Subtraction (example: subtraction of a
negative from a
positive is performed as addition).
Adding/subtracting two signed operands gives:
AB
ABss
ss
ss
A sigsigBA 101101
The actual type of operation carried out is decided by the original
operation code
(add/sub) and the signs of the operands.
Therefore, if 'add' operation is coded as 'op=0' and 'sub'
operation as 'op=1', the actual
operation can be derived:
24
3.4 Data Path
In essence, the datapath manages the three elements - Sign,
Significand and Exponent
- using separate paths with some interaction between them:
Sign - the result sign is dependent on the input operand Signs, the
type of
operation performed, and the Sign of the result of the
significand
addition/subtraction.
normalizing.
Exponent - the result exponent is formed by choosing the larger
exponent and
revaluing according to the normalization.
Accordingly, the above algorithm can be divided into smaller
sub-algorithms, and
each one can be organized as a separate resource ('black
boxes'):
Program Counter - holds address of current instruction. Address
advances
with each clock cycle.
Translate - decode DEC64 operand to
(Sign,Exponent,Significand).
Exponent Comparator - compare operand exponents and return
exponent
difference, which exponent is bigger and its value.
Check Needed - check whether there is need for significand shifting
and
addition/subtraction (due to limited precision).
Right Shifter - Aligning one of the significands according to the
exponent
difference.
the significands.
Normalizer - adjusting the result significand and exponent values
to avoid
leading zeros in significand.
25
Significand) values into DEC64 operand.
Sign Decision - Conclude Result Sign according to the input operand
Signs,
the type of operation performed, and the Sign of the result of the
significand
addition/subtraction.
Each of the above mentioned resources was built as a function in a
MATLAB script
for simulation (Chapter 4) and later implemented in a low-level
Verilog design
(Chapter 5).
As discussed in Chapter 6, an instruction is divided into four
stages, i.e. moving from
single-cycle datapath to a four-stage-pipelined datapath.
Therefore, the complete performance of an operation with a DFPU
involves the
following stages:
1. Instruction Fetch (IF): Retrieval of DFPU command.
2. Decode (D): Retrieval and translation of DEC64 Operands to
Sign,
Significand and Exponent fields.
3. Execution (E): Performing the arithmetic algorithm mentioned
above.
4. Write Back (WB): Result Sign, Significand and Exponent are
encoded into
DEC64 format and written to register file or result Memory.
These four stages are implemented as pipe stages. Further in this
design, Pipeline
Registers are set between each two stages in order store
Intermediate results.
3.5 Instruction Set Architecture(ISA)
Opcode length: 5 bits.
Arithmetic operations: add_r, add_m, sub_r, sub_m, inc_r, inc_m,
dec_r, dec_m,
neg.
26
Note: the number of bits allocated for opcode is bigger than
necessary in order to
enable future expansion of instruction set.
Arithmetic operations
opcode ri rj (optional) rk (optional)
5 bits 5 bits 5 bits 5 bits 54 bits
add_r:
o Operation Description: dual operand addition; result written to
Register
File.
o Actual operation: ri=rj+rk
o Datapath Description:
Decode: two operands are read from the Register File in location
set
by index of ri,rj. These operands are translated to spread form.
Result
address and spread operands are saved in pipeline register as well
as
result address and control signals.
Execute: Exponents are compared and significands are aligned
accordingly. Significands are added and addition result together
with
the bigger exponent derived from Exponent Comparator go
through
normalization and rounding. The final sign is derived from
Sign
decision. Result {Sign,Sig,Exp} are saved in pipeline register as
well
as result address and control signals.
Write Back: finally, the correct result is inverted to DEC64
format.
Final result is written to the Register File.
add_m:
o Operation Description: dual operand addition; result written to
Register
File and Result Memory.
o Command Format: add_m ri,rj,rk
o Actual operation: ri=rj+rk ; Mem[mem_addr]=rj+rk ;
mem_addr++
o Datapath Description: Identical to description of add_r, except
that the
result is written to both Result Memory and Register File.
27
sub_r:
File.
o Datapath Description:
Decode: two operands are read from the Register File in location
set
by index of ri,rj. These operands are translated to spread form.
Result
address and spread operands are saved in pipeline register as well
as
result address and control signals.
Execute: Exponents are compared and significands are aligned
accordingly. Significands are subtracted and subtraction
result
together with the bigger exponent derived from Exponent
Comparator go through normalization and rounding. The final sign
is
derived from Sign decision. Result {Sign,Sig,Exp} are saved
in
pipeline register as well as result address and control
signals.
Write Back: finally, the correct result is inverted to DEC64
format.
Final result is written to the Register File.
sub_m:
o Operation Description: operands subtraction; result written to
Register File
and Result Memory.
o Actual operation: ri=rj-rk ; Mem[mem_addr]=rj-rk ;
mem_addr++
o Datapath Description: Identical to description of sub_r, except
that the
result is written to both Result Memory and Register File.
inc_r:
o Operation Description: increase operand by one; result written to
Register
File.
o Actual operation: ri =ri+1
o Datapath Description: Identical to add_r, except that the second
operand
that is added is an artificially created constant whose value is
+1, and that
both source and destination register is ri.
28
inc_m:
o Operation Description: increase operand by one; result written to
Register
File and Result Memory.
o Command Format: inc_m ri
o Actual operation: ri= ri+1 ; Mem[mem_addr]= ri+1 ;
mem_addr++
o Datapath Description: Identical to inc_r, except that the result
is written to
both Result Memory and Register File.
dec_r:
o Operation Description: decrease operand by one; result written to
Register
File.
o Command Format: dec_r ri
o Actual operation: ri= ri-1
o Datapath Description: Identical to add_r, except that the second
operand
(the subtrahend) is artificially created to equal -1, and that both
source and
destination register is ri.
dec_m:
o Operation Description: decrease operand by one; result written to
Register
File and Result Memory.
o Actual operation: ri= ri-1 ; Mem[mem_addr]=ri-1 ;
mem_addr++
o Datapath Description: Identical to dec_m, except that the result
is written
to both Result Memory and Register File.
neg:
o Operation Description: change sign of register operand; result
written to
Register File.
o Datapath Description:
Decode: an operand is read from the Register File in location set
by
index of ri and is saved in pipeline register as DPD (Densely
Packed
Decimal) operand as well as result address and control
signals.
Execute: the DPD operand, result address and control signals
saved
in the next pipeline register.
29
Write Back: the first bit of the DPD operand is complemented
and,
along with the rest of the DPD operand bits, is written to the
Register
File.
o Command Format: mov_i ri,imm
o Actual operation: ri=imm
o Datapath description:
Decode: Immediate data is saved directly into Pipeline register,
as
DPD operand as well as result address and control signals.
Execute: the DPD operand, result address and control signals
are
transferred to next pipeline register.
Write Back: the DPD operand is written back to register file
in
address mentioned by result address in write back pipeline
register.
Instruction format:
mov_r:
o Command Format: mov_r ri,rj
o Actual operation: ri=rj
o Datapath description:
Decode: an operand is read from the Register File in location set
by
index of ri and is saved in pipeline register as DPD operand as
well
as the result address (rj) and control signals.
Execute: the DPD operand, result address and control signals
are
transferred to next pipeline register.
Write Back: the DPD operand is written back to register file
in
address mentioned by result address in write back pipeline
register.
opcode ri immediate
Table 3.1 shows a summary of the ISA properties.
30
Format Operation Description
5 bits 5 bits 5 bits 5 bits 54 bits
opcode ri rj(optional) rk(optional)
ri=rj+rk add_r ri,rj,rk Dual operand addition; result written
to
Register File add_r
ri=rj+rk
Register File and Result Memory add_m
ri=rj-rk sub_r ri,rj,rk Operands subtraction; result written
to
Register File sub_r
Register File and Result Memory sub_m
ri =ri+1 inc_r ri Increase operand by one; result written
to Register File inc_r
to Register File and Result Memory inc_m
ri= ri-1 dec_r ri Decrease operand by one; result
written to Register File dec_r
ri= ri-1
written to Register File and Result
Memory dec_m
ri= -ri neg ri Change sign of register operand; result
written to Register File neg
5 bits 5 bits 64 bits
opcode ri immediate ri=imm mov_i ri,imm Transfer immediate value to
register mov_i
5 bits 5 bits 5 bits 59 bits
opcode ri rj ri=rj mov_r ri,rj Transfer one registers' value to
another mov_r
Table 3.1. Summary of the ISA properties.
32
The Following section describes the construction of MATLAB
simulations matching the
arithmetic and encoding/decoding algorithms, and the tests run on
them in order to assess
their practical implementation.
The importance of such simulations is in the simple application of
the algorithms in a way
that mirrors a practical implementation. Similarly, tests run on
the simulations can reveal
flaws in the practical application of the algorithms.
4.2 Translate, Inverse Translate Simulation
Relevant standard sections (referring to Fig. 2.1):
"The representation r of the floating-point datum, and value v of
the floating-point datum
represented, are inferred from the constituent fields as
follows:
a)
If G0 through G4 are 11111, then v is NaN regardless of S.
Furthermore, if G5 is 1,
then r is sNaN; otherwise r is qNaN. The remaining bits of G are
ignored, and T
constitutes the NaN’s payload, which can be used to distinguish
various NaNs. The
NaN payload is encoded similarly to finite numbers described below,
with G treated
as though all bits were zero. The payload corresponds to the
significand of finite
numbers, interpreted as an integer with a maximum value of 10 (3×J)
− 1, and the
exponent field is ignored (it is treated as if it were zero). A NaN
is in its preferred
(canonical) representation if the bits G6 through Gw + 4 are zero
and the encoding of
the payload is canonical.
b)
If G0 through G4 are 11110 then r and v = (−1) S × (+∞). The values
of the
remaining bits in G, and T, are ignored. The two canonical
representations of infinity
have bits G5 through Gw +4 = 0, and T = 0.
c)
For finite numbers, r is (S, E − bias, C) and v = (−1) S × 10
(E−bias) × C, where C
is the concatenation of the leading significand digit or bits from
the combination field
G and the trailing significand field T, and where the biased
exponent E is encoded in
the combination field. The encoding within these fields depends on
whether the
implementation uses the decimal or the binary encoding for the
significand." 9
Simulation Method
33
3. Combination field = other (finite numbers).
In the first two cases: The Combination field bits are preset and
500 sets of additional 59
random bits are generated. Correct Simulation of the translate
function activates NaN/Inf
flags accordingly.
In the final case:
64 random bits are generated and testing is performed as
followed:
1. For each random binary vector x1- Translate command is used to
find (sign1,
significand1 and exponent1).
2. Inverse Translate parameters (sign, significand and exponent)
back to binary vector -
'res'.
3. For Binary vector 'res' - Translate command is used to find
parameters (sign2,
significand2 and exponent2) and compare with (sign1, significand1
and exponent1).
4.3 Exponent Comparator Simulation
In accordance with the Arithmetic Algorithm (see Para. 3.3)
addition/subtraction of operands,
includes finding the bigger exponent, and exponent difference.
According to standard:
"The set of finite floating-point numbers representable within a
particular format is
determined by the following integer parameters:
b = the radix, 2 or 10
p = the number of digits in the significand (precision)
emax = the maximum exponent e
emin = the minimum exponent e
emin shall be 1 − emax for all formats." 9
In the decimal 64 format: emax=+384, b=10, p=16. Therefore, the
dynamic exponent range
[emin,emax] = [-383,384]. It is important to note that all
exponents in IEEE 754_2008 format
are biased, that is:
"For finite numbers, r is (S, E − bias, C) and v = (−1) S × 10
(E−bias) × C ... where the
biased exponent E is encoded in the combination field.." 9
In our case, the bias is 383. Therefore the actual range of the
exponent E is [0,767].
34
Run all possible combinations of e1,e2 to test exponent_comparator
function.
4.4 Full Path Simulation
Using all the resources simulated in MATLAB, one full path can be
constructed, creating a
full addition/subtraction path that can be simulated.
Simulation Method
The simulation of the full addition / subtraction path consists of
3 stages:
1. Initialization: 1000 pairs of 64 bit, DEC64 coded operands are
randomly created. Each
pair is translated into spread form.
2. Run: Each pair of 64 bit DEC64 operands is input into
'Full_Path'. For each pair in
'Full_Path', a matching DEC64 format addition result is created,
and translated to spread
format.
3. Result Analysis: The initial operands are added externally in
MATLAB and compared to
the result output of the 'Full_path' in its spread form. If one of
the random operands is a NaN
or Inf, the result operand should reflect it in its Combination
field.
Note: Due to precision limitations of MATLAB, these simulations
needed to employ the use
of Variable Precision Arithmetic functions (VPA) in the Symbolic
Math Toolbox. These
functions allow for variable precision and provide more flexibility
and control in
manipulating numbers.
4.5 Simulation Results
1. Translate, Inverse translate: of 1000 cases, there were no cases
found where 'x1' differs
from 'res' (i.e. the original vector and the result vector
differ).
2. Exponent Comparator: All cases of 'Exponent Comparator'
variables were examined - no
errors were found.
3. Full Path: of 1000 cases (each case using 2 random variables),
all cases proved the
operand addition/subtraction creates the expected result using the
above algorithm.
35
In conclusion
In 100% of the cases, simulation results matched the expected
values.
4.6 Full Path Graphic User Interface
In addition to MATLAB simulation of the full path, a Graphic User
Interface (GUI) was
designed in order to have a user-friendly simulation tool for
decimal floating-point
computation that complies with the IEEE 754-2008 standard.
Figure 4.1 shows the simulation GUI for decimal floating-point
computation.
The input operands and result are also displayed in DEC64 format
(in hexadecimal form).
Figure 4.1. Simulation GUI for decimal floating-point
computation.
36
5.1 Low-level Design
The low-level design of the DFPU is implemented in Verilog, using
Cadence Simvision
simulation tool. Each resource mentioned in Chapter 3.4 is
implemented as a separate Verilog
file, and is checked against its own test bench.
5.2 Program Counter
The Program Counter (Fig. 5.1) consists of a simple 8 bit counter
that produces the address of
current instruction. Address advances with each clock cycle.
Figure 5.1. Program Counter.
A 'jump to address' option is created for further design. jmp_en
bit is used to enable the jump
and 8 bit offset value defines the jump amount.
5.3 Register File
The Register File (Fig. 5.2) consists of 32 registers, each one
with a 64 bit width.
Read: Two registers can be read simultaneously (Dual-Port Register
File), using the registers
index (5 bit).
Write: 64 bit of data can be written to a register, using the
register's index and setting the
Write enable bit.
A register can be written while reading from a different indexed
register, i.e. results are
written back to register in parallel to reading operands during
decode stage.
37
Figure 5.2. Register File.
5.4 Translate & Inverse Translate
The Translate component (Fig. 5.3) decodes a DEC64 input operand to
sign, exponent, and
significand. If the given input is a NaN/Infinity, the isNaN/isInf
output bit is set.
Figure 5.3. Translate component.
The Inverse Translate component (Fig. 5.4) encodes sign, exponent,
and significand to a
DEC64 output operand. If the encoded operand is a NaN/Infinity, the
input isNaN/isInf bit
declares it.
5.5 Exponent Comparator
The Exponent Comparator (Fig. 5.5) subtracts the input exponents
and returns:
diff – the difference between the input exponents.
isBigger – a bit that indicates which exponent is bigger. (0: if
exp1≥exp2. 1: if exp1<exp2).
biggerexp – the bigger exponent.
Figure 5.5. Exponent Comparator.
5.6 Check Needed
The Check Needed component (Fig.5.6) simply checks if the input
diff > 17decimal. If it does:
en=0 and there is no need of shifting, adding or subtracting. Else
en=1.
Figure 5.6. The Check Needed component.
39
5.7 Right Shifter
The Right Shifter component (Fig.5.7) aligns the input significands
according to the other
inputs:
isBigger –indicates which operand has a bigger exponent.
diff –the amount of shifts (difference between exponents of the
operands).
The output significands consist of 76 bits.
If en_in is set: Right Shift must be performed. The significand to
be shifted is concatenated
with 64 bits (16 trailing zeroes) and goes through a Barrel
Shifter.
The significand to be shifted is chosen according to the value in
isBigger:
If isBigger=0 - sig2 is shifted by 'diff' positions.
If isBigger=1 - sig1 is shifted by 'diff' positions.
The output of the Barrel Shifter is truncated to 76 bits (19
digits).The 19 th
Digit of the
truncated output must serve as the Sticky Digit - signifying the
existence of non-zero trailing
digits. The Sticky Digit is constructed according to the following
rule: If the 19 th
Digit is not
zero, then the Sticky Digit retains its value. The Sticky Digit
will retain a 0 value if and only
if the 56 least significant bits are 0. Otherwise it is set to
decimal 1.
The unshifted significand is concatenated with three trailing zero
digits (12 bits).
If en_in is cleared: the output significand of the bigger operand
is concatenated with three
trailing zero digits. The other output significand is zero (won't
be used later).
en_out=en_in and is designed for timing reasons.
Shifting is done using Barrel Shifter, which fasten the
operation.
Figure 5.8 shows the implementation of the Right Shifter
component.
41
Figure 5.8. Implementation of the Right Shifter component.
5.8 Adder/Subtractor
Given two unsigned significands, the main goal is to produce a new
result significand, which
is the output of one of the following scenarios:
1. Addition: adding the two significands.
2. Subtraction: subtracting the two significands (return the
absolute difference).
3. No operation (return one specific significand out of the two
input significands).
Figure 5.9 shows a general description of the
Adder/Subtractor.
40
Inputs:
sig1, sig2 – The input significands consist of 19 digits each,
while each digit is
represented by 4 bits (BCD - binary-coded decimal representation),
thus an input
significand consists of 76 bits.
en – Operation enable bit. When set - Addition/Subtraction is
carried out. When cleared -
no operation is taking place.
add/sub – Indicates the type of operation. 1 - Addition. 0 –
Subtraction.
isBigger – Indicates which of the operands that include the
significands is bigger.
1: if operand1 < operand2.
0: if operand1 ≥ operand2.
Note: isBigger gives no information about the relation between sig1
and sig2.
Outputs:
res. sig – The output significand consisting of 19 digits (76
bits).
c_out – The output carry of the operation.
op_sign – The sign of the output result
Addition or Subtraction of two significands cannot be done bitwise,
but must be performed in
groups of 4 bits due to the use of BCD representation.
Note: BCD representation is a 4 bit binary representation for
decimal digits in range:
{0:9}10→{0000 - 1001}2.
42
Figure 5.10 describes the implementation of 4 bit BCD Adder for
calculation of a single
digit.
Similarly to binary subtraction using 1's complement, calculation
of Subtraction is carried out
using 9's Complement representation. Therefore, if the current
operation is subtraction
(add/sub = 0), the complemented digit B (which is 9-B) is chosen by
the multiplexer at the
entrance of the upper 4 bit Full Adder. The reason for
complementing B is that subtraction is
simply addition with 9's complemented Subtrahend. 10
Whenever the sum of the upper 4 bit Full Adder exceeds (1001)2=9,
the output sum has to be
fixed so that the output will equal (sum-10) and carry out will
equal 1.This can be achieved
by adding (0110)2=6to the sum.
___________________________________________________________________________
Proof: (sum-10) = sum + 6 - 16 = (sum+6) - 16. Subtracting 16 from
(sum+6) is the same as
___________________________________________________________________________
The check whether a fix is needed can be obtained by a rather
simple circuit.
A fix is needed whenever carry out=1. Examination of the Truth
Table in Table 5.1
concludes:
outcsssoutcarry _321
The carry out bit also serves as the input decision bit of the
output Multiplexer.
43
Figure 5.10. Implementation of 4 bit BCD Adder.
If the carry out bit is cleared, the output is the sum of the upper
4 bit Full Adder.
If the carry out bit is set, the output is the sum of the lower 4
bit Full Adder (the fixed sum).
44
0 0 0/1 0 0 0
1 1 0/1 0 0 0
0 0 0/1 1 0 0
1 1 0/1 1 0 0
0 0 0/1 0 1 0
1 1 0/1 0 1 0
0 0 0/1 1 1 0
1 1 0/1 1 1 0
0 0 0/1 0 0 1
1 1 0/1 0 0 1
1 0 0/1 1 0 1
1 1 0/1 1 0 1
1 0 0/1 0 1 1
1 1 0/1 0 1 1
1 0 0/1 1 1 1
1 1 0/1 1 1 1
Table 5.1. Truth Table for carry out bit.
_____________________________________________________________________
!10
0100
0110
1110
1001
0101
neededisfixdec
Answer is (sum = 4) and (carry = 1) which corresponds to 14.
Example 5.2.
!10
0111
0100
0011
neededisfixnodec
_____________________________________________________________________
45
Both addition and subtraction are executed using Carry-Select Adder
with groups of 4 bits
(Fig. 5.11).
Reminder: calculations are carried out using BCD representation;
therefore calculation of a
certain digit (4 bits) should be separated from calculations of the
other digits.
Carry-Select Adder saves the carry ripple time 11
in exchange of adding another Full Adder
for calculation of each digit (except for the least significant
digit). This approach follows the
principle: "High Performance over Silicon Cost".
The added digits (each consists of 4 bits) enter two identical 4
bit BCD Adders, where the
input carry of one adder is logic '0' and the input carry of the
other adder is logic '1'. A
Multiplexer chooses one of the two sums produced and one of the two
output carries. The
decision bit is the carry out that is chosen in the previous
Multiplexer.
Addition:
Adding two significands is rather simple when compared to
Subtraction.
The output op_sign that should indicate the sign of the result is
always 0, since the result is
always positive.
In case of a result significand that is ≥ 10, carry out = 1.
Subtraction:
1. sig1 > sig2
2. sig1 ≤ sig2
The result of subtraction should be displayed in an absolute
value.
The output op_sign should indicate the sign of the result.
The first subtraction scenario leads to (carry out = 1). This
wrap-around-carry should be
added to the result significand.
_____________________________________________________________________
= (sig1-sig2) + 99……999 => carry out = 1 since (sig1-sig2) >
0.
adding wrap-around-carry => (sig1-sig2) + 99……999 + 1 =
46
_____________________________________________________________________
_____________________________________________________________________
= (sig1-sig2) + 99……999 = (=> carry out = 0 since (sig1-sig2)
≤0)
= 99……999 - (sig2-sig1) = (sig2-sig1) complemented =>
=> complementing the answer will give (sig2-sig1) =
|sig1-sig2|.
_____________________________________________________________________
In order to save the time of adding the wrap-around-carry (first
scenario) or completing the
answer (second scenario), the Adder/Subtractor component calculates
the result according to
the three scenarios (one in Addition, two in Subtraction) in
parallel and a Multiplexer
chooses the output significand.
Using two 76 bit Carry-Select BCD Adder (named pipe1 and pipe2),
one with (carry in = 0)
and the other with (carry in = 1), and a 9's complement unit, a
correct result can be obtained
for each of the three scenarios:
For Addition: the result is the output of pipe1 (carry in =
0).
For Subtraction (first subtraction scenario): the result is the
output of pipe2
(carry in = 1).
Adding a wrap-around-carry is the same as setting (carry in = 1)
from the first place.
For Subtraction (second subtraction scenario): the result is the
9's complement of the
output of pipe1 (carry in = 0).
Note: creation and completion of the output of pipe1 is carried out
in parallel (once an output
digit is calculated, it is completed) and not after the entire
output significand is calculated.
Figure 5.12 shows the implementation of the Adder/Subtractor
component.
47
No operation:
When the input en bit is cleared, no operation is needed. Therefore
the output significand
and op_sign are chosen according to the other input bits: add/sub
and isBigger.
isBigger is cleared (operand1≥operand2):
The output significand is sig1 and the result is positive (op_sign
= 0).
isBigger is set (operand1<operand2):
The output significand is sig2.
If the operation is Addition (add/sub = 1) then op_sign = 0.
If the operation is Subtraction (add/sub = 0) then op_sign =
1.
The Truth Table in Table 5.2 summarizes the relations between the
outputs and the inputs:
op_sign c_out res. sig p1_cout add/sub isBigger en
0 0 sig2 0 0 0 0
0 0 sig2 1 0 0 0
0 0 sig2 0 1 0 0
0 0 sig2 1 1 0 0
1 0 sig1 0 0 1 0
1 0 sig1 1 0 1 0
0 0 sig1 0 1 1 0
0 0 sig1 1 1 1 0
1 0 pipe1c 0 0 0 1
0 0 pipe2 1 0 0 1
0 0 pipe1 0 1 0 1
0 1 pipe1 1 1 0 1
1 0 pipe1c 0 0 1 1
0 0 pipe2 1 0 1 1
0 0 pipe1 0 1 1 1
0 1 pipe1 1 1 1 1
Table 5.2. Truth Table for the outputs of the Adder/Subtractor
component, where: p1_cout is the
carry out of pipe1 Adder. pipe1 is the pipe1 output
significand.pipe1c is the pipe1 output significand
complemented. pipe2 is the pipe1 output significand.
Therefore:
coutpsubaddenoutc
subaddisBiggerensignop
_1/_
Figure 5.11. Implementation of a 76 bit Carry-Select BCD
Adder.
51
50
5.9 Normalizer
The Normalizer (Fig. 5.13) shifts the input significand according
to the Normalizing
specifications (see Para. 2.4).
c_out indicates that a carry out has occurred in the previous
resource (significand ≥ 10).
OVF/UNDF bit is set if the normalization causes an
overflow/underflow.
Figure 5.13. Normalizer.
Shifting is done using Barrel Shifter, which fasten the
operation.
5.10 Rounder
The Rounder (Fig. 5.14) rounds the input significand according to
the Rounding
specifications (see Para. 2.4).
5.11 Sign Decision
The Sign Decision component (Fig. 5.15) finds the final sign of the
result operand.
Figure 5.15. The Sign Decision component.
If add/sub=1: the actual operation carried out in Adder/Subtractor
was addition. Therefore
the final sign is sign1.
If add/sub=0: the actual operation carried out in Adder/Subtractor
was addition. Therefore
the final sign is (sign1XORop_sign), where op_sign is the sign of
the result of the
subtraction.
Figure 5.16 shows the implementation of the Sign Decision
component.
Figure 5.16. Implementation of the Sign Decision component.
53
6.1 Composing the Complete System
In order to compose the complete system, the implemented units
(Chapter 5) were integrated
into a full, connected datapath.
6.2 Creating a Pipelined Datapath
An instruction is divided into four stages, i.e. moving from
single-cycle datapath to a four-
stage-pipelined datapath, which means that up to four instructions
will be in execution during
any single clock cycle.
By creating three Pipeline Registers, stages are separated and
information is saved with each
rising edge of the clock.
The pipeline execution throughput is one instruction per
cycle.
Figure 6.1 shows the Pipelined Datapath.
Note: certain signals (OVF, UNDF, isNaN, isInf etc.) were omitted
from Figure 6.1in order
to describe the main flow of data.
6.3 Creating a Control Unit
In order to manage the advancement of the pipeline, and manage the
control signals (enable
bits, multiplexer decision bits) of the different concurrent
operation, a central Controlling
Unit is necessary. The control unit is implemented as a Finite
State Machine (FSM).
Given 6 states: Instruction Fetch, Decode, Execute, Write Back,
Idle state (for system reset),
wait4inst (system out of reset and waiting for input cache to
load), and a pipelined datapath,
several states can coexist simultaneously (each combination of
IF,D,E,WB). For each of the
possible state combinations, the control unit allocates a unique
state, and sends signals to the
datapath according to the current state. Overall there are 17
states available.
54
55
Figure 6.2 describes the transfer function between states.
For simplicity, states in the diagram were joined according to
similar next states.
Transitions between states are represented by
<condition>/<next state>. For example: NI/EW
means that no new instruction has arrived and the next state is
EW.
Each transfer between states depends on the previous state, and
whether or not there is a
following instruction to perform (in that case the FSM will receive
mem_valid=1). So long as
there is a following instruction to perform, a valid bit is sent to
the Program Counter in order
to fetch the new instruction. Once there are no more instructions
to perform, the Program
Counter valid bit is cleared, and no more instruction are
fetched.
Note: the pipeline continues to execute the existing
instructions.
For each new instruction Fetched, the opcode is analyzed by the
FSM, and it returns the
appropriate control signals. Similarly, with each state transfer,
write enable signals are sent to
the pipeline registers.
Table 6.1 describes the values of the control signals for each type
of input instruction opcode.
The control signals are:
DPD source –decision for the source of the densely packed decimal
operand.
wsource –decision for the source of the data written.
incdec –decision for inc/dec.
wbmethod – decision for Write Back method.
negator – decision for negate operation.
wen – write enable for the Register File.
sub_op – decision for subtract operation.
selfwrite – decision for read & write to the same
register.
An extra signal is reserved for further design.
56
57
DPD
source wsource incdec unbinop wbmethod negator wen sub_op selfwrite
reserved
add_r 1 1 0 0 0 0 1 1 0 0
add_m 1 1 0 0 1 0 1 1 0 0
sub_r 1 1 0 0 0 0 1 0 0 0
sub_m 1 1 0 0 1 0 1 0 0 0
inc_r 1 1 1 1 0 0 1 1 1 0
inc_m 1 1 1 1 1 0 1 1 1 0
dec_r 1 1 0 1 0 0 1 1 1 0
dec_m 1 1 0 1 1 0 1 1 1 0
neg 1 0 0 0 0 1 1 0 1 0
mov_i 0 0 0 0 0 0 1 0 0 0
mov_r 1 0 0 0 0 0 1 0 0 0
Table 6.1. Control Signals for each type of instruction.
6.4 Pipeline Hazards
A Read-after-Write (RAW) data hazard may occur in the designed
Pipeline, which can result
in incorrect computation.
This hazard occurs when an instruction refers to a result that has
not yet been calculated or
retrieved.
add_r r5,r1,r4
r1 is read before its true value is written, because the second
instruction starts the Execution
stage when the first instruction starts the Write Back stage.
Some of the possible future solutions for this problem are:
1. Stalling the pipeline (will increase latency).
2. Forwarding: once an instruction finishes its Execution stage,
the result can be used
immediately in the Execution stage of next instruction. 12
3. Reordering instructions to avoid hazards (done by the designated
compiler).
58
7.1 Verification Properties
The main properties necessary for verification and validation for
the DFP unit are:
Correct calculation of arithmetic operation - includes arithmetic
testing and correct
Datapath operation.
Correct calculation of arithmetic operation
The following types of test were preformed:
1. Correct addition/subtraction of operands.
2. Correct operation for large exponent difference.
3. Correct handling of Overflow/Infinity.
Compliance with IEEE 754-2008 standard specifications
The following specifications were tested:
1. Correct translation to/from DEC64 format to spread decimal
floating point format.
2. Correct result rounding according to IEEE 754-2008 standard
scheme chosen.
3. Correct encoding/decoding of infinity/NaN.
The guidelines for the verification were taken from test vectors
published by Prof. Mike
Cowlishaw 1 . His work was written prior to the publishing of the
new IEEE 754-2008
standard, and therefore could not be used in full, but the
principals of his verification
technique were adapted for this project.
Initially, as a 'quick confidence check', a sample assembly program
was loaded into the
instruction cache and results were validated. This initial test
examined the basic operation for
each available command.
The following step was to build a more robust testing array, based
on the scheme in
Figure 6.3.
Figure 6.3. Verification Scheme.
The test vectors used for comparison were taken from Cowlishaws'
website and also from
IBM Haifa's Floating Point test Generator 2,3
. The IBM test vectors were translated to DFP
commands according to the DFPU ISA Using AWK scripts.
The Commands were loaded into the UUT (Unit Under test - DFPU) and
results were printed
out and compared to the results given by IBM.
7.2 Verification Conclusions and Results
Correct calculation of arithmetic operation
1. Correct addition/subtraction of operands - verified. In cases
where results differed, close
examination showed that the cause was different rounding
schemes.
2. Correct operation for large exponent difference -
verified.
3. Correct handling of Overflow/Infinity - verified. In cases where
results differed, close
examination showed that the cause was different rounding
schemes.
Compliance with IEEE 754-2008 standard specifications
1. Correct translation to/from DEC64 format to spread decimal
floating point format -
verified.
61
2. Correct result rounding according to IEEE 754-2008 standard
scheme chosen - Despite
the different rounding schemes, in some cases the result is rounded
to the same value. Of
all the cases examined, some errors in rounding were identified and
corrected. In other
cases, the DFPU result agreed with the chosen rounding scheme, and
differences between
the DFPU and the IBM test vectors were due to different rounding
schemes.
3. Correct encoding/decoding of infinity/NaN – verified.
60
8.1 Implementation on FPGA
The integrated system was implemented using Virtex®-6 FPGA ML605
Evaluation Board. In
order to load the design, the Xilinx ISE Design Suite 13.2 was
used. The *.list files, used in
the Cadence Simvision environment to simulate the instruction and
result memory , were
implemented in the Virtex6 system using Distributed RAM, loaded
with *.coe files.
inst_mem.coe represents our instruction memory and is the basis of
our Test bench.
8.2 Design Evaluation
Running the design on the Virtex6 provided the possibility to test
the actual ability to run the
design on real-life hardware with real-life hardware constraints.
Specifically, it allows testing
Timing and Clock Frequency constraints.
Solving synthesis problems
A significant problem with the synthesis was that the designed
Normalizer included a While
loop which is not synthesizable. Conversion of the While loop to a
series of conditional-if
solved this issue.
Identifying optimal clock rate
The process of identifying the optimal clock rate for the DFPU
involved running the unit on
higher clock rates until incorrect results are returned due to
inability to conclude command
performance.
Using PLL (Phase-locked loop), multiple clocks with different rates
were created.
The working clock was chosen by on-board switches.
The optimal clock rate identified for the DFPU is MHz66 .
62
The DFPU is a hardware implementation of decimal arithmetic
algorithms (specifically
addition, subtraction and related operations). Its' high-level
design is integrated into the low-
level design. It has undergone algorithm simulation, verification
and final hardware synthesis.
The design is unique in terms of several parameters:
The Design is built to comply with the IEEE 754-2008 standard
definitions.
The design includes an advanced Adder/Subtractor, which provides
equal runtime for
addition or subtraction calculations, and avoids wasteful (both in
terms of time and
silicon size) comparison of significands which existed in earlier
designs 13
, which
provides modularity.
The design provides addition/subtraction with a latency of 4 clock
cycles and one
clock cycle throughput.
9.2 Future Expansions
Potential expansions to the DFPU range from functionality to
efficiency.
Functionality
Additional DFP Functions should be made available, such as:
Multiply, Divide, Fused
Multiply Add, Compare etc.
Additional Control Functionality should be made available, such as
Loop Support and
Branch support.
Additional hardware for the creation of detailed Data Payload in
case of Invalid
Operation Exception should be made available.
Efficiency
Adder/Subtractor can be enhanced using carry-look-ahead in each 4
bit BCD adder.
Further attempts to create an even pipeline should be made. For
example, it is possible
to take advantage of distributed RAM capabilities to speed up Fetch
stage and upend
it to the following Decode stage, thus forming a 3 stage
pipeline.
Support for advanced data hazard solutions can be added
(Forwarding, Reordering,
see Para. 6.4)
A. DFP History
The suggested DFP Unit is not the first decimal floating-point unit
implemented, but it is
unique in that it complies with the new IEEE 754-2008
standard.
Hardware solutions
Select Past attempts:
ENIAC - The United States Military began construction of the ENIAC
during WWII
(1943), designed to calculate artillery firing tables for the
United States Army's
Ballistic Research Laboratory. The ENIAC could store a ten digit
decimal number in
memory, but could not perform decimal computations. 4
Bell Laboratories Mark V - The first documented decimal
floating-point processor
was the Bell Laboratories Mark V computer designed in 1946. 5
Burroughs 2500 & 3500 - another important Decimal
floating-point computer was
the Burroughs 2500, developed in 1966. It used strings of up to 100
digits, with two
4-bit BCD (Binary-Coded Decimal) digits per byte.
These examples were developed before the existence of a
floating-point standard.
The 754-1985 standard was the first to define formats for
representing floating-point numbers
and special values (NaN, Inf), floating-point operations, rounding
modes and exceptions. The
standard in use today - IEEE 754-2008 - revised and replaced the
IEEE 754-1985. The
revision extended the previous standard in including, among other
things, decimal arithmetic
and formats, and merged in IEEE-854 (1987) - the radix-independent
floating-point standard.
Two examples of a standardized Decimal Floating Point Unit are the
IBM Z9 (2005-2006)
and Z10 (2008).The Z9 utilized an encoded decimal representation
for data, instructions for
performing decimal floating point computations, and an instruction
which performed data
conversions to and from the decimal floating point representation.
The System Z9 was the
first commercial server to add IEEE 754 decimal floating point
instructions, although these
instructions were implemented in microcode with some hardware
assists.
The Z10 introduced Full hardware support for Hardware Decimal
Floating-point Unit
(HDFU): it implemented the main IEEE 754 decimal floating point
operations in a built-in,
integral component of each processor core and instruction set
architecture. 6
Note: It is important to note that the Z10 was developed before the
publication of the IEEE
754-2008 standard.
Software solutions
For reasons of backwards compatibility and in order to gain
software flexibility, several
software libraries, capable of handling decimal floating-point
operations were developed.
Some of the more well-known ones are:
Intel® Decimal Floating-Point Math Library
decNumber/decNumber++ by Mike Cowlishaw 1
These solutions indeed solve the precision issue, but fall short
(and actually worsen the
situation) with regards to the speed requirement.
Research performed in the University of Wisconsin show that when
using the decNumber
library for DFP arithmetic, most benchmarks spend more than 75% of
their execution time in
DFP functions. 7,8
The research also showed that providing fast hardware support for
DFP
instructions results in speedups for the same benchmarks ranging
from 1.3 to 31.2.
65
Bibliography
Press, Feb. 2003)
6. www.ibm.com/systems/z/hardware/
7. Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya
Jhalani, Benchmarks and
Performance Analysis of Decimal Floating-Point Applications
(University of Wisconsin -
Madison, Department of Electrical and Computer Engineering, Oct.
2007)
8. Michael J. Schulte, Nick Lindberg, Anitha Laxminarain,
Performance Evaluation of
Decimal Floating-Point Arithmetic (University of Wisconsin -
Madison, Department of
Electrical and Computer Engineering,2005)
9. IEEE Standard for Floating-Point Arithmetic (IEEE Computer
Society, Aug 2008)
10. Anshul Singh, Aman Gupta, Sreehari Veeramachaneni, M.B.
Srinivas, A High
Performance Unified BCD and Binary Adder/Subtractor (IEEE Computer
Society
Annual Symposium on VLSI, 2009)
11. Israel Koren, Computer Arithmetic Algorithms (A. K. Peters/CRC
Press, 2nd edition,
Dec, 2001)
12. David A. Patterson, John L. Hennessy, Computer Organization and
Design, The
Hardware/Software Interface (Morgan Kaufmann, 4th edition, Nov.
2008)
13. John Thompson, Nandini Karra, Michael J. Schulte, A 64-bit
Decimal Floating-Point
Adder (IEEE Computer Society Annual Symposium on VLSI: Emerging
Trends in VLSI
Systems Design (ISVLSI'04), 2004)