Architecture-aware Taylor Shift by 1
A Thesis
Submitted to the Faculty
of
Drexel University
by
Anatole D. Ruslanov
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
December 2006
© Copyright December 2006 Anatole D. Ruslanov. All Rights Reserved.
11
Acknowledgements
• I would like to thank my advisors Jeremy Johnson and Werner Krandick for
their support and patience.
• I would like to thank Jiirgen Gerhard, Guillaume Hanrot et al., Bernard Mour-
rain et al., and George Collins, who developed IPRRIDB, for making their code
available.
• I would like to thank Paul Zimmermann for providing his code for the modular
convolution method.
• Thomas Decker invented the interlaced polynomial representation to improve
the SACLIB method.
• I would like to offer my profound gratitude to my parents, Rena and David
Finko for their endless and unfailing patience and support.
• I would like to thank P. R. Sarkar and Sandra Valenzano-Dillon for inspiration
and perspective.
• I would like to thank Jamie Chandler, Robert Garbellano, Donna Hanelin,
Wendy Henson, Sarah Hohenberger, Sherry Paris, Milan Sampat, Virat Staer,
and Clifford J. Straehley for support and encouragement. Friends do matter!
• I would like to thank Carolyn Kieber Grady and Diane Ranck for final proof
reading.
iii
Table of Contents
List of Tables vii
List of Figures ix
Abstract xv
1. Preliminaries 7
1.1 Introduction 7
1.2 Thesis organization 7
1.3 Taylor shift by 1: Analysis 8
1.4 Performance and computer architecture 15
1.4.1 Pipelined processors 16
1.4.2 Superscalar execution (ILP) 18
1.4.3 Memory hierarchy 20
1.5 Methodology and experimental procedures 24
1.5.1 Processor architecture 24
1.5.2 Hardware configuration 26
1.5.3 Compilation protocol 27
1.5.4 Input polynomials 28
1.5.5 Performance counter measurements 30
1.5.6 Cache flushing 30
1.6 Literature review 31
1.6.1 Taylor shift by 1 31
1.6.2 Asymptotically fast methods 32
1.6.3 Crossover points for GNU-MP, NTL, and SACLIB 34
iv
1.6.4 Register tiling 35
1.6.5 Compilers and automatic code generation and tuning 36
1.6.6 Real root isolation 37
1.6.7 Notes on experimental methodology 39
2. Straightforward implementations 40
2.1 Introduction 40
2.2 GNU-MP addition 42
2.3 NTL addition 43
2.4 The SACLIB method 44
3. The tile method 47
3.1 Introduction 47
3.2 Description of the algorithm 48
3.3 Properties of the algorithm 51
3.3.1 Register tile schedule 52
3.3.2 An example 53
3.4 Performance 59
3.4.1 Experimental methodology and platform 60
3.4.2 Execution time 60
3.4.3 Efficiency of addition 62
3.4.4 Memory traffic reduction 62
3.4.5 Cache miss rates 66
3.4.6 Branch mispredictions 66
3.4.7 Computing times in the literature 70
3.5 Automatic code generation 72
V
3.5.1 Processor utilization 73
3.5.2 The 4100 degree irregularity 76
4. Modeling Taylor shift by 1 79
4.1 Introduction 79
4.2 A model for GNU-MP addition 79
4.3 Modeling the straightforward method 83
4.4 Modeling the tile method 94
4.4.1 Impact of changing the number of engaged IEUs 99
4.4.2 Finding optimal register tile size 101
5. Asymptotically fast methods 104
5.1 Introduction 104
5.2 Performance of the fast methods 106
5.3 Computing times in the literature 109
5.4 Improving performance of the fast methods 109
5.4.1 Replacing native NTL arithmetic with GNU-MP ari thmetic. . . 113
5.4.2 Observations on coding 116
5.5 Conclusions 117
6. Applications 118
6.1 High-performance de Casteljau's algorithm 118
6.2 High-performance Descartes method 120
6.2.1 Monomial vs. Bernstein bases 120
6.2.2 The Descartes methods we compare 123
6.2.3 Performance results 125
7. Future research 136
vi
Bibliography 138
Vita 146
Vll
List of Tables
3.1 An optimal instruction schedule for the 8 x 8 register tile for the Ultra
SPARC III processor 54
3.2 An example polynomial with its coefficients in the interlaced representa
tion 55
3.3 The output polynomial after Taylor shift by 1 computation with its co
efficients not normalized 58
3.4 The output polynomial after Taylor shift by 1 computation with its co
efficients normalized 59
3.5 Computing times (s.) for Taylor shift by 1 —"small" coefficients 70
3.6 Computing times (s.) for Taylor shift by 1 —"large" coefficients 71
4.1 Parameters used for modeling GNU-MP addition. The patch refers to
Gaudry's patch [39] 82
4.2 Experimentally determined cost for register tile execution 96
4.3 Cost of the b x b register tile execution in cycles. The chosen b is the
optimal value for the particular platform, see Section 3.5 96
4.4 Experimentally determined cost of the 3 versions of the carry propagation
in processor cycles for the register tile sizes ranging from 4 x 4 to 24 x 24. 97
4.5 Cost of the delayed carry propagation in cycles 98
4.6 Parameters used in modeling the tile method 98
5.1 Computing times (s.) for the divide and conquer method of Taylor shift
by 1 —"small" coefficients 110
Vll l
5.2 Computing times (s.) for the convolution method of Taylor shift by 1
—"small" coefficients 110
5.3 Computing times (s.) for the Paterson & Stockmeyer method of Taylor
shift by 1 —"small" coefficients I l l
5.4 Computing times (s.) for the divide and conquer method of Taylor shift
by 1 —"large" coefficients I l l
5.5 Computing times (s.) for the convolution method of Taylor shift by 1
—"large" coefficients 112
5.6 Computing times (s.) for the Paterson & Stockmeyer method of Taylor
shift by 1 —"large" coefficients 112
6.1 Root isolation timings in milliseconds for Intel Pentium EE 132
6.2 Root isolation timings in milliseconds for AMD Opteron 133
6.3 Root isolation timings in milliseconds for UltraSPARC III 134
6.4 Root isolation timings in milliseconds for Intel Pentium 4 135
IX
List of Figures
1.1 By Theorem 1.3.3, the pattern of integer additions in Pascal's triangle, a>%,3 = &i,3-i + ai-i,j) c a n be used to perform Taylor shift by 1 10
1.2 The coefficients of the polynomial A^(x) in the proof of Theorem 1.3.3 reside on the &-th diagonal. Multiplication of Ak(x) by (x + 1) can be interpreted as an addition that follows a shift to the right and a downward shift 11
1.3 Pipelining delivers efficient execution of machine instructions 16
1.4 Pipeline stalls caused by dependencies such as mispredicted branches may cause a slowdown by a factor of 10 to 30 19
1.5 Superscalar feature of UltraSPARC III processor accommodates up to 4 simultaneous instructions in its Execute stage 20
1.6 Memory hierarchy of 1 GHz UltraSPARC III processor 22
1.7 A simple direct-mapped cache 23
2.1 The straightforward method we consider uses integer additions to compute the elements of the matrix in Figure 1.1 from top to bottom and from left to right 41
2.2 The GNU-MP assembly addition routine for the UltraSPARC III platform 43
2.3 Performance gain due to Gaudry's patch for the straightforward method. 44
2.4 Taylor shift by 1 in SACLIB 46
3.1 a. Tiled Pascal triangle, b. Register tile stack. A register tile is computed for each order of significance. Carries are propagated only along lower and right borders 49
X
3.2 a. A scheduled 8 x 8 register tile. Arrows represent memory references, "+" signs represent additions. Numbers represent processor cycles. The 2 integer execution units (IEUs) perform 2 additions per cycle, b. Register tile: a sketch of the proof of Theorem 3.3.1 52
3.3 Pascal's triangle for the example polynomial. The coefficients of the polynomial and the elements of the triangle are in decimal representation. 53
3.4 Pascal's triangle for the level of significance 0 56
3.5 Pascal's triangle for the level of significance 1 56
3.6 Pascal's triangle for the level of significance 2 56
3.7 Non-normalized register tile for the level of significance 0 57
3.8 Non-normalized register tile for the level of significance 1 57
3.9 Non-normalized register tile for the level of significance 2 57
3.10 Normalized register tile for level of significance 0 after carry propagation; now the radix (3 = 8 58
3.11 Normalized register tile for level of significance 1 after carry propagation; now the radix (3 = 8 58
3.12 Normalized register tile for level of significance 2 after carry propagation; now the radix (3 = 8 58
3.13 The tile method is up to 7 times faster than the straightforward method. 61
3.14 For the input polynomials Cn,d the tile method computes a whole register tile stack at the precision required for just the constant term 63
3.15 In GNU-MP addition the ratio of cycles per word addition (left scale) increases with the cache miss rate (right scale) 64
3.16 In classical Taylor shift by 1 the tile method requires fewer cycles per word addition than the straightforward method 65
xi
3.17 The tile method substantially reduces the number of memory reads required for the Taylor shift; the extent of the reduction depends on the compiler 67
3.18 For large degrees the tile method has a lower cache miss rate than the straightforward method. Moreover, the number of cache misses generated by the tile method is small because the tile method performs few read operations 68
3.19 The number of branch mispredictions per cycle is negligible for the tile method and the straightforward method 69
3.20 Impact of tile size on the performance of the tile method on Pentium EE processor. Legend: tile size in word x word 74
3.21 Impact of tile size on the performance of the tile method on Opteron processor. Legend: tile size in word x word 74
3.22 Impact of tile size on the performance of the tile method on Pentium 4 processor. Legend: tile size in word x word 75
3.23 Impact of tile size on the performance of the tile method on UltraSPARC III processor. Legend: tile size in word x word 75
3.24 Processor utilization in word additions per cycle for the straightforward method 77
3.25 Processor utilization in word additions per cycle for the tile method 77
4.1 Modeling GNU-MP addition for the UltraSPARC III processor 84
4.2 Modeling GNU-MP addition for the Pentium 4 processor 85
4.3 Modeling GNU-MP addition without Gaudry's patch for the Opteron processor 86
4.4 Modeling GNU-MP addition with Gaudry's patch for the Opteron processor 87
Xll
4.5 Modeling GNU-MP addition for the Pentium EE processor 88
4.6 The distribution of the length of sums L in the straightforward method. Both experimental and modeled data provided 91
4.7 Modeling the straightforward method for the UltraSPARC III processor. 91
4.8 Modeling the straightforward method for the Pentium 4 processor 92
4.9 Modeling the straightforward method for the Opteron processor 92
4.10 Modeling the straightforward method for the Opteron processor with Gaudry's patch 93
4.11 Modeling the straightforward method for the Pentium EE processor 93
4.12 The rolled delayed carry release routine for the tile method 97
4.13 Modeling the tile method on the Pentium EE architecture 99
4.14 Modeling the tile method on the Opteron architecture 100
4.15 Modeling the tile method on the Pentium 4 architecture 100
4.16 Modeling the tile method on the UltraSPARC III architecture 101
4.17 Impact of changing the number of IEUs on the lower bound for the computing time of the tile method for AMD Opteron architecture 102
5.1 The tile method is faster than the asymptotically superior divide and conquer method for a wide range of degrees 106
5.2 All asymptotically fast methods are slower than the divide and conquer method on the Pentium EE. The convolution method is over 80 x slower than the tile method and is not shown 107
5.3 All asymptotically fast methods are slower than the divide and conquer method on the Opteron. The convolution method is over llOx slower than the tile method and is not shown 107
Xll l
5.4 All asymptotically fast methods are slower than the divide and conquer method on the Pentium 4. The convolution method is over 50 x slower than the tile method and is not shown 108
5.5 All asymptotically fast methods are slower than the divide and conquer method on the UltraSPARC III. The convolution method is over 200 x slower than the tile method and is not shown 108
5.6 Using 64-bit arithmetic on the Pentium EE improves the crossover point. The convolution method is over 30 x slower than the tile method and is not shown 113
5.7 Using 64-bit arithmetic on the Opteron improves the crossover point. The convolution method is over 25 x slower than the tile method and is not shown 114
5.8 Using 64-bit arithmetic on the Pentium 4 improves the crossover point. The convolution method is over 8x slower than the tile method and is not shown 114
5.9 Using 64-bit arithmetic on the UltraSPARC III improves the crossover point. The convolution method is over 62 x slower than the tile method and is not shown 115
5.10 Gaudry's patch improves the crossover point on the Opteron. The convolution method is over 19 x slower than the tile method and is not shown. 116
6.1 The computation structure of (a) Taylor shift by 1 and (b) de Casteljau's algorithms 118
6.2 (a) The pattern of integer additions in Pascal's triangle, al>3 = at>3-\ + at-\>3) can be used to perform Taylor shift by 1. (b) In de Casteljau's algorithm all dependencies are reversed, the intermediate results are computed according to the recursion b3>l = ^-i,» + &j_i)l+i 119
6.3 Register tiling can be applied to (a) Taylor shift by 1 and (b) de Casteljau's algorithm. Arrows show direction of addition 119
XIV
6.4 Speedup with respect to the monomial SACLIB implementation for random polynomials on four architectures 126
6.5 Speedup with respect to the monomial SACLIB implementation for Cheby-shev polynomials on four architectures 127
6.6 Speedup with respect to the monomial SACLIB implementation for reduced Chebyshev polynomials on four architectures 128
6.7 Speedup with respect to the monomial SACLIB implementation for Mignotte polynomials on four architectures 129
XV
Abstract
Architecture-aware Taylor Shift by 1
Anatole D. Ruslanov Advisors: Jeremy R. Johnson and Werner Krandick
We introduce register tiling for optimizing series of multiprecision additions.
Our new tile method for designing an architecture-aware classical Taylor shift by 1
algorithm—a low-level operation important to the monomial bases variant of the
Descartes method for polynomial real root isolation—obtains up to 7 times faster
performance over standard implementations that call the efficient integer addition
routines from the GNU Multiple Precision Arithmetic Library [44].
Our tile method for Taylor shift by 1 algorithm requires more word additions
but it reduces the number of cycles per word addition by decreasing memory traffic
and the number of carry computations. To enable standard compilers to tile the
algorithm, we introduce signed digits, suspended normalization, radix reduction,
and delayed carry propagation.
The performance of our tile method depends on several parameters that can be
modeled for and tuned to the underlying architecture. We show how such modeling
can guide automatic code generation and automatic experimentation to adapt an
algorithm to the underlying architecture for better ILP and pipeline utilization. We
automatically generate our tile method for Taylor shift by 1 in a high-level language
and tune it to four different processor architectures.
The architecture-aware tile method outperforms four asymptotically fast meth-
XVI
ods up to degree 6000 on the four hardware platforms. We analyze feasibility of
constructing high-performance architecture-aware fast methods.
Using our register tiling technique, we automatically generate and tune de Castel-
jau's algorithm, an operation with similar pattern of additions. The algorithm—
"probably the most fundamental computation in the field of curve and surface de
sign" (G. Farin [33])—is the main subalgorithm of the Bernstein bases variant of the
Descartes method. We obtain similar performance gains.
Applying our architecture-aware algorithms, we compare performance of several
implementations of the monomial bases and Bernstein bases variants of the Descartes
method on four processor architectures and for three classes of input polynomials.
All variants have the same asymptotic computing time bound. The comparison
shows that the best absolute computing times are obtained on an Opteron processor
platform using the Bernstein-bases variant of the Descartes method with register
tiling.
1
Foreword
Computer algebra systems, such as Maple and Mathematica, provide many effi
cient algorithms for exact computation with mathematical objects such as arbitrary
precision integers, rational numbers, polynomials, and, more generally, mathemati
cal expressions. Improved algorithms, better implementations, and faster computers
have enabled many previously time-consuming computer algebra computations to be
performed routinely. However, many computations still require excessive computing
time, and there are many cases where the performance achieved by an implementa
tion could be dramatically improved.
Many challenges for achieving high performance in computer algebra algorithm
implementations are due to their irregular structure and higher level data types.
Most of the work in high-performance algebraic algorithm design has been focused on
reducing arithmetic complexity and bit-complexity when the size of numbers is im
portant. However, simply reducing the number of arithmetic operations or using op
timized implementations of basic arithmetic operations is insufficient. An algorithm-
level perspective that considers the entire computation—not just its parts—must be
adopted instead.
Modern computers are complex systems that incorporate features such as pipelin
ing, superscalar execution, speculative computing, and multilevel memory hierar-
2
chies, to achieve high performance. These features, when properly utilized, can lead
to dramatic improvement in performance. However, effective utilization of the pro
cessor is a highly non-trivial problem, which can not simply be left to the compiler.
The complex interactions of the features make it difficult to predict performance and
has led to an empirically based approach called automated performance tuning [70].
In fact, effective utilization of these features can be more important than reducing
the number of arithmetic operations in obtaining high-performance code and can
lead to an order of magnitude improvement in performance.
Achieving effective algorithm-level optimizations typically requires transforma
tions such as high-level restructuring of the algorithm, changing data structures
and reordering operations to overcome dependencies. Programming is done in a
high-level language for portability and ease of maintenance. Portable coding also
simplifies automatic code generation and performance tuning, which finds the best
algorithm for high performance on a particular architecture through automatic ex
perimentation.
An execution model that takes features of the architecture into account would be
helpful in guiding the choice of transformations and optimizations and selecting the
best implementation. However, as indicated, the complexities of modern processors
along with lack of detail provided by hardware vendors make this difficult. The
difficulty in accurately modeling performance leads to a more empirical approach
relying on benchmarking and profiling (including measuring the utilization of the
features of the processor) which searches for the best implementation. Nonethe
less, modeling, while not always an accurate prediction of performance, can provide
insight and reduce the amount of search required.
3
Designing high-performance algorithms for modern processors requires consider
able effort. We think of our work as computing with abstractions that arise from
the architecture versus more usual computing with abstractions that arise from the
underlying mathematical operations. This allows for a more architecture-centered
approach to designing high-performance algorithms.
4
Summary of contributions
My thesis has addressed the following problems and has made the following
contributions:
1. We have applied known high-performance architecture-aware algo
rithm design techniques—in particular, register tiling optimizations—
to Taylor shift by 1. The algorithm has a pattern of additions that a com
piler should be able to optimize for high performance. Tuning multiprecision
("bignum") integer addition, however, is a challenge because compilers cannot
perform such optimizations without understanding of the algorithmic domain.
We have avoided low-level assembly coding with a high-level language algo
rithm that exploits features of the architecture and enables the compiler to
perform optimizations that it otherwise would have been prevented from per
forming. We also reduced implementation time and software maintenance cost
with automatic code generation and tuning to a target architecture.
2. We have determined that there is a large range of input sizes for
which our classical approach to Taylor shift by 1 outperforms asymp
totically fast approaches. While our experiments utilized existing imple
mentations of several asymptotically fast algorithms for Taylor shift by 1 that
5
were not architecture-aware, we have performed extensive profiling and inves
tigated several approaches for redesign to further improve their performance.
We have demonstrated that effective utilization of features of the computer
architecture significantly affects crossover points. However, our results suggest
that, even with these enhancements, there is a wide range of inputs (up to de
gree 6000 in our studies), covering most practical sizes, where the tuned clas
sical approaches significantly outperform the asymptotically fast algorithms.
3. We have implemented a high-performance de Casteljau's a lgori thm
by applying the knowledge gained from designing the architecture-
aware Taylor shift by 1. De Casteljau's algorithm—a fundamental com
putation for curve and surface design—has a similar pattern of additions and
benefit from the same optimization techniques as Taylor shift by 1 computa
tion.
4. We have used the high-performance Taylor shift by 1 and de Castel
jau's a lgorithms in polynomial real root isolation. Using efficient ker
nels is not straightforward due to incompatible data structure interfaces, in
ability to apply high-level optimization across calls to the kernel routines,
and need for special instances of these kernel routines. We used our high-
performance versions of Taylor shift by 1 and de Casteljau's algorithms for
comparing the performance of algorithmically tuned implementations of the
monomial and Bernstein variants and architecture-unaware implementations
of both variants on four different processor architectures and for three classes
of input polynomials. The comparison shows that the best absolute computing
6
times are obtained on an Opteron processor platform using the Bernstein-bases
variant of the Descartes method with register tiling.
The results of this dissertation have led to the following publications:
1. High-performance architecture-aware Taylor shift by 1, (with Jeremy R. John
son and Werner Krandick), 10th International Conference on Applications of
Computer Algebra, Lamar University, July 21-23, 2004, Beaumont, Texas.
2. Architecture-aw are classical Taylor shift by 1, (with Jeremy R. Johnson and
Werner Krandick), International Symposium on Symbolic and Algebraic Com
putation, pages 200-207, ACM Press, 2005.
3. Using high-performance Taylor shift by 1 in real root isolation, (with Jeremy R.
Johnson and Werner Krandick), 11th International Conference on Applications
of Computer Algebra, Nara Women's University, July 31 - August 3, 2005,
Nara, Japan.
4. High-performance implementations of the Descartes method, (with Jeremy R.
Johnson, Werner Krandick, Kevin M. Lynch, David G. Richardson), Interna
tional Symposium on Symbolic and Algebraic Computation, pages 154-161,
ACM Press, 2006.
7
1. Prel iminaries
1.1 Introduct ion
This thesis is a study of methods for computing Taylor shift by 1, a low-level op
eration that is important for the monomial bases variant of the Descartes method of
polynomial real root isolation, an essential algorithm in computer algebra systems.
Both classical and asymptotically fast methods of Taylor shift by 1 are studied. The
thesis presents a new architecture-aware method that introduces register tiling opti
mization techniques for algorithms that involve patterns of multiprecision additions.
Since our register tile method (see Chapter 3) outperforms the asymptotically fast
methods for a wide range of degrees, it is also a study of how useful the theoretically
fast algorithms are in practical applications. In addition, we apply the register tiling
technique to a similar algorithm: de Casteljau's algorithm, a fundamental method in
computer-aided design [33] which is the main subalgorithm of the Bernstein-bases
variant of the Descartes method for real root isolation. We then apply the tiled
Taylor shift by 1 and de Casteljau's algorithms to their respective variants of the
Descartes method and experimentally compare performance of both variants for
three classes of input polynomials on four different processor architectures.
1.2 Thesis organization
This thesis is organized as follows. In Chapter 1, we define Taylor shift by 1 as
a series of additions that compute elements of Pascal's triangle, discuss computer
8
architecture and its influence on performance, discuss our experimental methodol
ogy, describe the four architecture platforms used in our experiments, and review
previous work. In Chapter 2, we discuss the straightforward methods for comput
ing Taylor shift by 1 based on GNU-MP [44] and NTL [82] libraries as well as the
SACLIB [22] method. Chapter 3 presents the tile method of computing Taylor
shift by 1, including its performance advantages on the UltraSPARC III architec
ture [49, 86]. We also discuss automatic code generation and tuning in the chapter.
Chapter 4 describes modeling GNU-MP [44] addition, the straightforward method,
and the tile method. Chapter 5 presents asymptotically fast methods for computing
Taylor shift by 1, performance comparison to the tile method, and our research into
ways of improving performance of the fast algorithms. In Chapter 6 we discuss ap
plying register tiling for implementing high-performance de Casteljau's algorithm.
We then apply the tiled implementation of de Casteljau's algorithms to derive and
compare several high-performance variants for the Descartes method of real root
isolation. In the final Chapter 7, we present ideas for further research.
1.3 Taylor shift by 1: Analysis
Let A(x) be a univariate polynomial with integer coefficients. Taylor shift by 1
is the operation that computes the coefficients of the polynomial B(x) = A(x + 1)
from the coefficients of the polynomial A(x). Taylor shift by 1 is the most time-
consuming subalgorithm of the monomial bases variant of Descartes method [21]
for polynomial real root isolation. Taylor shift by 1 can also be used to shift a
polynomial by an arbitrary integer a. Indeed, if B(x) = A(ax) and C(x) = B(x+ 1)
and D(x) = C(x/a), then D(x) = A(x + a). According to Borowczyk [12], Budan
proved this fact in 1811.
Theorem 1.3.1. Let A{x) = Y J atx% be a polynomial of degree. Taylor shift by 0<i<n
1 computes Taylor expansion B{x) = Y J b%x% = A(x + 1) = Y J at(x + 1)* where 0<i<n 0<i<n
K = (nt)an + (™71)a„_i + . . . + (l)at for i = 0,..., n.
Proof Induction on n using the binomial theorem. □
We will call a method that computes Taylor shift by 1 classical if the method
uses only additions and computes the intermediate results given in Definition 1.3.2.
Definition 1.3.2. For any non-negative integer n let In = {(i,j) \ i, J' > 0 A i-\-j <
n}. If n is a non-negative integer and
A(x) = anxn + . . . + a\X + a0,
is an integer polynomial we let, for k G { 0 , . . . , n} and (i, j) G In,
a-i,k = 0,
Qfe,-1 = «ra- fc j
as shown in Figure 1.1.
Theorem 1.3.3. Let n be a non-negative integer, and let A(x) = anxn+.. .+aix+a0
10
0 0 0 0 4* 4* 4* 4*
an —> ao,o Qo,i &o,2 ao,3 Q n - i ~ ^ Qi,o Qi , i Qi,2
O n - 2 ~ ^ &2,0 ^2,1
Q n - 3 ~ ^ a 3 , 0
Figure 1.1: By Theorem 1.3.3, the pattern of integer additions in Pascal's triangle, a>i,3 = ai,3-i + a*-i,j> c a n be used to perform Taylor shift by 1.
be an integer 'polynomial. Then, in the notation of Definition 1.3.2,
n
A(x + 1) = ^ an-h,hXh. h=0
Proof. The assertion clearly holds for n = 0; so we may assume n > 0. For every
_nak-h,hX ■ Figure 1.2 shows that the coefficients of
the polynomial Ak reside on the k-ih diagonal of the matrix of Figure 1.1. Then,
for all k E {0 , . . . , n — 1}, we have Ak+i(x) = (x + l)Ak(x) + an-(k+i)- Now an easy
induction on k shows that Ak(x) = J2h=0 an_k+h(x + ^)h for all k G {0 , . . . , n}. In
particular, An{x) = Y^h=o a^x + ^)h = Mx + !)■ ^
Definition 1.3.4. Let a be an integer. The binary-length of a is defined as
[log2 HJ + 1 otherwise.
Definition 1.3.5. The max-norm of an integer polynomial A = anxn + - ■ --\-aiX-\-ao
is \A\oo = max(|a„| , . . . , |a0|).
11
A0(x)
Ck),o
Qi,o
x
«o,i
« i , i
«0,2
«1,2
X
«0,3
M{x)^ 0*2,0^ «2,1
A2(x)^ «3,0 A2(x)^ «3,0
A3(x) ' '
multiplication by x multiplication by 1
Figure 1.2: The coefficients of the polynomial A^(x) in the proof of Theorem 1.3.3 reside on the A;-th diagonal. Multiplication of Ak(x) by (x + 1) can be interpreted as an addition that follows a shift to the right and a downward shift.
The SACLIB method (see Section 2.4) and the new tile method (see Section 3.2)
for Taylor shift by 1 computation require a bound on the binary lengths of the
intermediate results a, H,J-
Theorem 1.3.6. Let n be a non-negative integer, and let A(x) = anxn+.. .+aix+a0
be an integer polynomial of max-norm d. Then, for all (i,j) G In,
!• <h,3 = C + ' H + C T l ) a n ~ i + ■■■ + ( j K - * > a n d
2. L(at>3) < L(d) + i + j .
Proof Assertion (1) follows from Definition 1.3.2 by induction on % + j . Due to
12
assertion (1),
M < ( f ^ + f-H-1 ) + ... + (' in
l + J + l
< 2t+3d
which proves assertion (2). □
Remark 1.3.7. Theorem 1.3.6 implies that, for degree n and max-norm d, the bi
nary length of all intermediate results is at most L(d) + n. The SACLIB method
(Section 2.4) can be slightly improved for small-degree polynomials by tightening
that bound for n E { 8 , . . . 39} to L(d) +n- 1, for n E { 4 0 , . . . , 161} to L(d) + n - 2 ,
and for n E {162 , . . . , 649} to L{d) + n - 3.
We will use Theorem 1.3.8 to prove lower bounds for the computing time of two
classes of input polynomials.
Theorem 1.3.8. Let n be a non-negative integer. Then at least n/2 of the binomial
coefficients ( \, 0 < k < n, have binary length > n/2. V k '
Proof. By direct computation, the assertion is true for all n E { 0 , . . . , 19}, so we
may assume n > 20. We then have
n - [n/A\ + 1 > 42.
13
Also, for 0 < % < [n /4 j ,
n — % n n [n/4j - % [n/A\ ~~ n/A
so that
n
Ln/4j n n—\ n— |_rz/4j + 1
Ln/4J [n/A\ - 1
> 4K4J+1 _ 22K4J+2 > 2™/2
Hence, the binary length of each binomial coefficient
n \ / n \ / n
Ln/4J ^' ^ K 4 J + 1 n - K4J
is > n /2 . But the number of those coefficients is > n/2. D
Theorem 1.3.10 and the proof of Theorem 1.3.9 characterize the computing time
functions of classical Taylor shift on the sets of polynomials Bn>d and Cn>d, see
Definition 1.5.1 in Section 1.5.4. We use the concept of dominance defined by
Collins [20] since it hides fewer constants than the more widely used big-Oh notation;
Collins also defines the maximum computing time function.
Theorem 1.3.9. Let t+(n, d) be the maximum computing time function for classical
Taylor shift by 1 where n > 1 is the degree and d is the max-norm. Then t+(n, d) is
co-dominant with n3 + n2L(d).
Proof. The recursion formula in Definition 1.3.2 is invoked \In\ = n(n+ l ) / 2 times.
14
Hence the number of integer additions is dominated by n2. By Theorem 1.3.6, the
binary length of any summand is at most L(d) + n. Thus the computing time is
dominated by n2 • (L(d) + n).
We now show that, for the input polynomials Bn>(i, the computing time dominates
n3 +n2L(d). Since, for any fixed n > 1, the computing time clearly dominates L(d)
we may assume n > 2. By Theorem 1.3.6 (1),
at, = )d
for all (i, j) G In. For k = % + j > 2, Theorem 1.3.8 yields that at least (k + l)/2 of
the binomial coefficients
fc+1),('; + 1 ) , . . . , ( ' ; + 1
1 ' V 2 7 V fc+1
have binary length > [k + l) /2. So, for all A; G {2 , . . . , n} there are least (k + l)/2
integers al>3 with iJrj=k and
L(o„)>L(d)-l + ^ i i
Now the assertion follows by summing all the lengths. □
Our proof of Theorem 1.3.10 assumes that the time to add two non-zero integers
a, b is co-dominant with L(a) + L(b); Collins [20] makes the same assumption in his
analysis.
15
Theorem 1.3.10. The computing time function of classical Taylor shift by 1 on the
set of polynomials Cn>d of Definition 1.5.1 is co-dominant with n3 + L(d).
Proof By Theorem 1.3.6, a„)0 = d+l and, for (i,j) E In — {(n, 0)},
(h,3 = [ )■ J
Hence, by Theorem 1.3.8, for any k G { 0 , . . . , n } , at least half of the integers a^o,
afe-i,i, ■ • ■, ao,fc have binary length > k/2. Since all of them—except possibly an>o—
have binary length < k, we have that
n k
-L(anfi) + 5 ^ 5 ^ L{ah-3t3) ~ n3. fc=0 j = 0
But the time to compute ara0 is co-dominant with L(d), and so the total computing
time is co-dominant with n3 + L(d). D
1.4 Performance and computer architecture
Effective computation implies meeting performance expectations. Effective com
putation also motivates studying algorithms and computer architecture for achieving
future high performance computing goals.
Relying on compilers for achieving high performance is an oversight because
compilers do not know the application domain and cannot optimize to the depth
possible if the domain is well-understood. We have shown that compilers cannot
deliver high performance even for a relatively basic classical algorithm [59] and
16
A 01
■o +^ tu o>
o
t 0)
o> u. < j
>, u '—
t <s 01 < j
>. U
Figure 1.3: Pipelining delivers efficient execution of machine instructions.
certainly not for its asymptotically fast variant, see Chapter 5.
This section reviews the features of computer architecture that have become
important recently and have significant influence on performance. We begin with a
discussion of modern pipelining techniques followed by a discussion of the memory
hierarchy.
1.4.1 P ipe l ined processors
Pipelined processors deliver efficient execution of machine instructions by fetch
ing and executing several instructions per cycle. Instruction execution is partitioned
into several steps, and these steps are overlapped - when an instruction has com
pleted a step the next instruction can use the hardware for that step, see Figure 1.3.
[46, 14, 29]
a +^ ■o
ecut
S -o ec
ut
fa
Rec
o U Re
c
ode
Exec
ute
mit
ord
Dec
Exec
ute
fa o
U Rec
Fetc
h ode
Exec
ute
mit
ord
Fetc
h
Dec
Exec
ute
fa o U Re
c
t o> o> +^ ■o
T Fe
tch ■o
o
Exec
ut
S -o
o> " 3
Fetc
h
Dec
Exec
ut
fa o U Re
c
>, w
17
Dependencies that present problems for smooth pipelining are called pipeline
hazards. The hazard conditions occur when the next instruction in the instruction
stream is prevented from being executed during its designated clock cycle because
either the result of a previous instruction still in the pipeline is not yet available or
the instruction to be executed itself is not known. [46, 14, 29]
For example, an instruction that is moving through the Execute stage must have
a value to operate upon. If this value is not available—such as if the preceding
instruction is a memory load, which computes the reference address in the Execute
stage and fetches the data in the Commit stage—then the instruction cannot proceed
and must be stalled until the data is available. [46, 14, 29]
Mispredicted branches are an example of control hazard—another common pipelin
ing concern. Branch instructions must move through several pipeline stages before
the target address is known, see Figure 1.4. Meanwhile, other instruction must en
ter the pipeline to keep it operating. Current processors use a variety of prediction
algorithms to "guess" which instructions to execute while waiting for the branch to
resolve. Effective branch prediction circuitry is important for superscalar pipelines
(see Section 1.4.2 below) because mispredicted branches may cause a slowdown by a
factor of 10 to 30 since several pipeline stages must be cleared out, each containing
more than 1 instruction. [46, 14, 29]
The pipeline hazards must be avoided as they cause sequential execution and,
hence, degrade pipeline performance. Compilers strive to schedule instructions so
that the dependencies have time to be resolved. Algorithms can be designed for
easy scheduling. Most modern processors are capable of out-of-order execution—a
processor feature, which rearranges instructions in hardware in order to move them
18
through the pipeline with minimal stall interruptions. Register renaming is used to
reduce artificial dependencies between the registers during program execution that
is imposed by the limited number of registers visible to the compiler. [46, 14, 29,
40, 24, 26, 25]
For high performance, memory references immediately followed by dependent
instruction and control structures with irregular behavior in particular should be
avoided. The compilers are good at eliminating dependency hazards associated
with memory references—through rescheduling the involved instructions. Control
structures are harder to optimize because they are usually part of a larger conceptual
constructs: algorithm design, ADT, and easy to maintain top-level code. [46, 14, 29]
There are three fundamental ways to improve pipeline performance through
hardware design: improve manufacturing techniques to make clock rate faster, intro
duce a longer pipeline that has smaller steps to increase clock rate without improve
ments in manufacturing (an approach taken with Pentium 4 and being abandoned
now), and increase superscalar execution where several simultaneous pipelines exe
cute more than one instruction per cycle. Improvements in superscalar execution is
the approach common today and can be effectively used through high-level language
scheduling to elicit high performance. [46, 14, 25, 2, 4, 86]
1.4.2 Superscalar execut ion (ILP)
Modern pipelined processors are designed for superscalar execution—also called
instruction-level parallelism or ILP. The processors have many pipelines each with
many functional units for executing several instructions in the same clock cycle.
This is accomplished by designing a dispatch unit that sends several instructions in
19
r ^- Concurrent pipeline stage
A o>
■o o> = m
it ■o -—*- o —*- 0) —*- S —j»- 0)
tu 01
n r̂ « "
!̂ """-~ Branch target address known
x o> ■o 1 m
it ■o -01 tu
—*-
Dec
—*-
Exec
- i» -
Com
—!»-
Rec
_ 4 . f |
] Stall cycles ■ | L j _ _ i i
A o>
■o 1 mit ■o -
St '£ —!»-
Dec
—*-
Exec
—*-
Com
—*-
Rec
Bra ticn target address fetched ! j
Figure 1.4: Pipeline stalls caused by dependencies such as mispredicted branches may cause a slowdown by a factor of 10 to 30.
parallel down several pipelines and a commit unit that completes the instructions
so that correctness is assured. [14, 46, 29]
Figure 1.5 illustrates superscalar capabilities of UltraSPARC III processor, which
has 6 pipelines that can simultaneously execute up to 4 independent instructions.
The UltraSPARC III processor is capable of up to 4x performance gain over a
non-ILP processors. [49, 86]
For high performance on a superscalar pipelined processor, independent instruc
tions must be scheduled (or packed) appropriately so that the processor can dispatch
them in the same cycle. A program must have enough usable parallelism to accom
plish this. While ILP is straightforward conceptually, implementation is complicated
due to "precedence" hazards. [14, 46, 29]
20
JS ■o ALU
tnm
it
scor
d
% o ALU tnm
it
scor
d
tu MEM BR
o U «
Fetc
h
Dec
ode ALU
ALU MEM BR C
omm
it
Rec
ord
1 ecod
e ALU ALU m
mit
ecor
d
tu O MEM BR
o U «
Figure 1.5: Superscalar feature of UltraSPARC III processor accommodates up to 4 simultaneous instructions in its Execute stage.
1.4.3 M e m o r y hierarchy
Memory hierarchy exists to deal with the ever increasing performance gap be
tween the processor and random access memory (RAM). Memory systems are de
signed to provide an illusion of a very large memory that can be accessed as fast
as a very small memory. Without well-designed hierarchical structure the memory
system will either be expensive or slow. [46, 14, 29]
At the top of the memory hierarchy is the processor register file that consists of
an array of very fast n-bit SRAM registers where n is the width of hardware word in
bits. The register file is part of the processor and is the fastest part of the memory
hierarchy. Machine instructions reference the registers directly. The processor must
fetch data into the registers for all computations.
The memory hierarchy consists of a number of caches or buffers between the
main memory (RAM) at the bottom of the hierarchy and the processor register file.
Cache is a small but fast memory that is used to store or prefetch items that have
21
been recently referenced or likely will be referenced soon. Caches greatly speed up
memory access due to the Principle of Locality, which states that programs tend to
relatively small portion of their address space at any instant of time. This
allows for a small but fast memory buffer (i.e., a cache) near the registers to contain
nearly 100% of the data and instructions required at the time of the execution.
[46, 14, 29]
The memory hierarchy of a 1 GHz UltraSPARC III processor—a usual design
for the current processors—is presented in Figure 1.6. It takes 1 processor cycle to
reach data in registers. The LI cache has 2 to 3 cycles latency, i.e., it takes up to
3 cycles to transfer data from the cache to a register. The latency for the L2 cache
is greater, typically 10 — 20 cycles. The latency for transferring data from the main
memory to the L2 cache is up to 200 cycles. The two levels of caches are used to
reduce the gap between fast CPU clock rates and the relatively long time to access
memory. [49, 86]
A cache miss occurs when the data (or instruction) accessed is not in cache, and
a cache hit occurs when the data is available in cache. Miss rate is a measure of how
well a particular program behaves with respect to a cache. Miss rate is influenced
by the algorithm coding techniques, by compiler efficiency, and by hardware design.
When a cache misses the cost of reaching the data is equal to the latency of reaching
the next level of the hierarchy. The LI caches are optimized to fast hits. The L2
caches are optimized for low miss rates. For high performance, data should be
shifted toward the processor registers. [14, 29, 7, 54]
22
1 ns 2-3 ns i 10-20 ns
100-200 ns
1 ns 2-3 ns i 10-20 ns
RAM Memoiy L2 Cache
RAM Memoiy
Register File Ll Cache L2 Cache
RAM Memoiy
Register File Ll Cache
| L2 Cache
RAM Memoiy L2 Cache
RAM Memoiy
r p n i
RAM Memoiy
Smaller, faster and expensive
Figure 1.6: Memory hierarchy of 1 GHz UltraSPARC III processor.
Cache organization
Caches are organized by the way they reference and store data and by the size
of data line (block size).
Direct-mapped caches have a simple architecture: a block from memory can map
to only one location in cache (by using a part of an address as an index into the
cache). These caches tend to be the fastest because they have the least number of
hardware comparators. However, they are vulnerable to regular memory access pat
terns: if the cache is accessed at a certain stride that causes each memory reference
to map to the same location in cache the miss rate will approach 100%. A simple
direct-mapped cache is shown in Figure 1.7. [46]
Fully associative caches are directly opposite in their organization to direct
mapped caches. A memory location can be placed anywhere in the cache; data
in the cache is replaced using least recently used (LRU) strategy or some approxi
mation to LRU. These caches have the lowest miss rate but are impractical because
of the high hardware cost, and to be effective they must be small. They tend to be
slow due to the number of comparisons required to find whether the data is in the
cache. [46]
23
| | | I**— Memory address
t Tag f Index J Block offset
11
' Block data (Cache line)
Figure 1.7: A simple direct-mapped cache.
Set associative caches are a compromise between these two architectures. These
caches are indexed like direct mapped caches but have several places where data can
be stored for each indexed location (set). An n-way set associative cache has n fully
associative locations per set; the sets are direct mapped. Set associative caches are
fast and their performance is usually similar to the fully associative caches. [46]
Block fetching and prefetching
In order to further reduce miss rate, caches fetch several words from memory
at a time. A block, also known as a cache line, is a group of contiguous words
that are transferred to a cache simultaneously. Block size (in words) is specific to a
particular architecture and is usually a power of 2. If a particular word is referenced
by the processor, all words that belong to the block will also be fetched. This
results in a substantial reduction in cache misses due to the Principle of Locality (see
Section 1.4.3)—particularly when fetching instructions and traversing arrays. [46, 29]
24
Most modern processors also feature hardware prefetching where data is brought
into cache ahead of memory references. This is also known as load prediction. In ad
dition, all modern Instruction Set Architectures (ISAs) include software prefetching
instructions to bring the data to cache ahead of use. However, software prefetch
ing can have detrimental effects on the performance if it interferes with hardware
prefetching. [46, 29, 25, 2, 4, 49]
1.5 Methodology and experimental procedures
In this section, we describe the hardware platforms, profiling techniques, and
input polynomials used in our experiments.
1.5.1 Processor architecture
Our tile methods (see Chapter 3 and Section 6.1) primarily achieve their speedup
by using delayed carry propagation and register tiling for respectively reducing mem
ory traffic and improving locality of reference. The computation schedule for the
register tiles allows multiple integer execution units to be used simultaneously, see
Section 3.3.1. When implementing the tile methods for a given processor the maxi
mum speedup that can be obtained by the methods is determined by the precision
of the native integer arithmetic (i.e., the width of the hardware registers), the num
ber of general purpose integer registers, and the number of integer execution units.
The speedup will be greater with a higher native integer precision and with a larger
number of the general purpose integer registers and integer execution units, see
Section 3.5.
25
64-bit processors
Current processors such as the Pentium EE [27], Opteron [3, 2], and Ultra
SPARC III [49, 86] that support native 64-bit integer operations have at least 16
64-bit general purpose integer registers, and at least 2 integer execution units. The
tile method was developed for such processors. We briefly summarize the relevant
features of these three processors below.
Pentium EE: The Intel Pentium Extreme Edition (EE) dual-core processor
supports both the 32-bit x86 and 64-bit EM64T instruction sets. Each core of the
Pentium EE provides 16 64-bit general purpose integer registers and has an 8-way
set-associative 16 kilobyte LI data cache and an 8-way set-associative 1 megabyte
L2 cache. Each core has 2 ALUs that are each capable of 2 arithmetic operations per
cycle. The processor is capable of register renaming, out-of-order execution, dynamic
cache prefetching, and dynamic branch prediction. The number of Pentium EE
pipeline stages has not been publicly disclosed. [27]
Opteron: The AMD Opteron processor supports the 32-bit x86 and 64-bit
AMD64 instruction sets. The Opteron provides 16 64 -bit general purpose integer
registers and has a 2 -way set-associative 64 kilobyte LI data cache and a 4 -way
set-associative 1 megabyte L2 cache. The Opteron processor has 3 ALUs that can
be independently engaged to decode, execute and retire 3 x86-instructions per cycle
in its 20-stage pipeline. The processor is capable of register renaming, out-of-order
execution, and dynamic branch prediction. [3, 2]
UltraSPARC III: The Sun UltraSPARC III processor supports the SPARC
V9 instruction set. The UltraSPARC III provides 32 64-bit general purpose in-
26
teger registers and has a 64 kilobyte 4-way set-associative LI data cache and an
8 megabyte 2-way set-associative L2 cache. Its superscalar architecture provides six
14-stage pipelines, four of which can be independently engaged. Two of the pipelines
perform integer operations, two floating point operations, one memory access, and
one pipeline performs branch instructions. The processor is capable of speculative
execution of branch instructions and memory loads. [49, 86]
32-bit processors
The Pentium 4 [26] is included for comparison only and is not expected to per
form well with the tile methods due to the unavailability of native 64-bit integer
arithmetic and the small number of general purpose integer registers.
P e n t i u m 4: The Intel Pentium 4 processor supports the 32-bit x86 instruction
set. The Pentium 4 provides 8 32-bit general purpose integer registers and has
a 16 kilobyte 8-way set-associative LI data cache and a 1 megabyte 8-way set-
associative L2 cache. The Pentium 4 processor has 2 ALUs that are each capable
of 2 operations per cycle. The processor has a 20-stage pipeline that is capable of
register renaming, out-of-order execution, dynamic cache prefetching, and dynamic
branch prediction. [26]
1.5.2 Hardware configuration
The hardware platforms used in this study are configured as follows:
P e n t i u m EE: We use a Pentium Extreme Edition 840 Dual-Core CPU with a
clock speed of 3.2 GHz and 1 GB of main memory. The Gentoo Linux distribu
tion with the 2.6.14-gentoo-r2 kernel is installed. Hyper-Threading is disabled in
27
the BIOS.
O p t e r o n : We use an Opteron 244 with a clock speed of 1.8 GHz and 2 GB of
main memory. The Gentoo Linux distribution with the 2.6.14-gentoo-r2 kernel is
installed.
U l t r a S P A R C III: We use a Sun Blade 2000 with two 900 MHz UltraSPARC III
processors and 2 GB of main memory. The Solaris 9 operating system is installed.
P e n t i u m 4: We use a Pentium 4 with a clock speed of 3.0 GHz and 1 GB of
main memory. The Fedora Core 2 Linux distribution with the 2.6.5-1.358 kernel is
installed.
1.5.3 Compi lat ion p r o t o c o l
This section describes how our software was compiled. The default compilation
flags were chosen because they deliver the best performance in most cases.
Default: All software was written in C and, unless noted below, was com
piled using gcc 3.4.4 with the flags "-03 -march=nocona -m64" on the Pentium EE,
gcc 3.4.4 with the flags "-03 -march=opteron -m64" on the Opteron, Sun Studio
9 compilers [85] with the flags "-x03 -xarch=v9b" on the UltraSPARC III, and
gcc 3.3.3 with the flags "-03 -march=pentium4" on the Pentium 4.
SACLIB: The SACLIB 3.0 (Beta) [77] library was used on the Pentium EE,
Opteron, and Pentium 4 machines. The SACLIB 2.1 library [22] was used on the
UltraSPARC III machine and compiled with Sun Studio 9 compilers [85] with the
flags "-x03." The programs IPRRID and IPRRIDB respectively call the SACLIB
3.0 (Beta) or 2.1.
NTL: On the Pentium EE, Opteron, and Pentium 4 machines NTL 5.4 [82] is
28
compiled using the compiler flags set by NTL. On the UltraSPARC III the Sun
Studio 9 compiler with the "-x03 -xarch—v9b" flags was used. NTL is limited to
32-bit integer arithmetic because of the way it performs multiplication; however, for
compatibility, NTL is compiled to use the 64-bit application binary interface (ABI).
GNU-MP: GNU-MP 4.2 [44] is compiled using the compiler flags as set by
GNU-MP.
SYNAPS: SYNAPS 2.4 [71] is compiled with the default compilers and flags.
SYNAPS required minor porting before it could be compiled with the Sun Studio 9
compiler for the UltraSPARC III platform.
Hanrot et al.: The code of Hanrot et al. [45] is compiled with the default
compilers and flags.
1.5.4 Input polynomials
For testing the tile method for Taylor shift by 1 computation (see Chapter 3),
we use the following two classes of polynomials:
Definition 1.5.1. For any positive integers n, d we define the polynomials
Bn,d(x) = d%n + dxa~l -\ h dx + d,
Cn4{x) =xn + d.
We sometime refer to Bnd as the "worst case" and Crad as the "best case" polyno
mials, see Theorems 1.3.10 and 1.3.9. In our experiments, d is usually set to 220 — 1
or 2™ — 1. Such fixed coefficient polynomials require slightly more time for Taylor
shift by 1 computation than random polynomials.
29
For testing the Descartes methods, we use random polynomials, Chebyshev poly
nomials, and Mignotte polynomials. These polynomials are commonly used bench
mark polynomials [57] for testing the Descartes method.
1. R a n d o m polynomials are integer polynomials with random 20-bit coeffi
cients or with random n-bit coefficients, where n is the degree. The coeffi
cients are pseudo-randomly generated from a uniform distribution. We report
computing times for degrees 100, 200, . . . , 1000. For random polynomials, the
Descartes method produces recursion trees that typically have few nodes.
2. Chebyshev polynomials are the polynomials defined by the recurrence re
lation T0(x) = 1, Ti(x) = x, Tn+l(x) = 2xTn(x) - Tn_x(x). The roots of
Chebyshev polynomials are well-known values of the cosine function. When
the Descartes method is applied to Chebyshev polynomials wide recursion trees
with many nodes are obtained. We report computing times for degrees 100,
200, . . . , 1000. Since all these degrees are even, the corresponding Chebyshev
polynomials are polynomials in x2. Since, for even n, the method by Hanrot et
al. [45] reduces Tn(x) to Tn(y/x), we apply the same pre-processing step also to
the other methods. We call the polynomials Tn(y/x) somewhat ambiguously
"reduced Chebyshev polynomials of degree n"; of course, deg(Tn(y/x)) = n /2 .
3. Mignot te polynomials are defined by xn — 2 (5a; — 1) . We are not aware of
any applications that involve Mignotte polynomials; however, the Descartes
method generates extremely deep recursion trees for Mignotte polynomials,
and it requires computing times that are approximately proportional to its
worst-case computing time function. We report computing times for degrees
30
100, 200, . . . , 600.
1.5.5 Performance counter measurements
All modern processors have special hardware counter registers that allow mea
suring many hardware events in real time. We accessed the performance counters
on our target processor through the CPC-library (provided with Solaris operating
system) and PAPI library [51] on all others. The PAPI library is convenient; it is
portable on most hardware platforms.
Where noted, we also monitored the following events: processor instructions,
branch mispredictions, and cache misses for the LI data cache, the LI instruction
cache, and the L2 external cache.
Execution times were computed from the number of processor cycles; for exam
ple, 1 cycle corresponds precisely to 1/1000 /is on a 1 GHz machine.
In addition, we used the UNIX getrusage system call [28] on all platforms to
obtain execution time in order to verify hardware counter measurements or when
using hardware counters was inconvenient.
1.5.6 Cache flushing
Before each measurement, we flushed the LI and L2 data caches by declaring
a large integer array and writing and reading it once [14]. We did not flush the
LI instruction cache; our measurements show that its impact on performance is
insignificant. We obtained each data point as the average of at least 3 measurements,
unless otherwise noted. The fluctuation within these measurements was usually well
under 1%. We did not remove any outliers, unless noted otherwise.
31
1.6 Literature review
This section presents a survey of background literature about Taylor shift by
1, register tiling, and their applications including de Casteljau's algorithm and the
Descartes method.
1.6.1 Taylor shift by 1
Recently von zur Gathen and Gerhard [89, 41] compared six different methods
to perform Taylor shifts. The authors distinguish between classical methods and
asymptotically fast methods. When the shift amount is 1, the classical methods
collapse into a single method which computes n(n + l)/2 integer sums where n is
the degree of the input polynomial. Von zur Gathen's and Gerhard's implementation
of classical Taylor shift by 1 simply makes calls to an integer addition routine. We
refer to such implementations as straightforward implementations. There are four
such methods, see Section 2.
The efficiency of straightforward methods depends entirely on the efficiency of
the underlying integer addition routine. Von zur Gathen and Gerhard use the integer
addition routine of NTL [83, 82] in their experiments. In Johnson et al. [59], we used
the GNU-MP [43] addition routine because the data (see Tables 1 and 2 in [59]) imply
that the GNU-MP routine is faster. In fact, NTL documentation [82] suggests using
GNU-MP arithmetic if high performance is desired. See Chapter 2 for more detail
on SACLIB, NTL, and GNU-MP multiprecision addition and the straightforward
methods. See Section 5.4 for a discussion of the impact high-performance GNU-MP
arithmetic has on asymptotically fast methods of Taylor shift by 1.
32
In Johnson et al. [59], we presented two algorithms that outperform straightfor
ward implementations of classical Taylor shift by 1. For input polynomials of low
degrees the routine IUPTR1 of the SACLIB library [22] is faster than straightfor
ward implementations by a factor of at least 2 on the UltraSPARC III platform. The
SACLIB routine IUPTR1 is described in Section 2.4. In addition, we developed a
new, architecture-aware tile method that is faster than straightforward implementa
tions by a factor of up to 7 on the UltraSPARC III platform. Chapter 3 describes the
tile method, reviews its performance, and extends it using automatic code genera
tion and tuning to the Pentium EE and Opteron platforms with similar performance
results.
It is widely believed that computer algebra systems can obtain high performance
by building on top of basic arithmetic routines that exploit features of the hardware.
It is also believed that only assembly language programs can exploit features of the
hardware. Results reported in this thesis, in Johnson et al. (2005) [59], and in
Johnson et al. (2006) [58] suggest that these tenets are wrong.
1.6.2 Asymptot ica l ly fast m e t h o d s
The tile method [59] is faster than four asymptotically fast methods for Taylor
shift by 1 [89, 41, 97] up to degree 6000 on 4 platforms [58], see Chapter 5. We
are not aware of applications of Taylor shift by 1 for such high degrees. This is
an example of a common gap between theoretical expectations and practical results
from asymptotically fast algorithms. The concern is whether the fast algorithms
deliver the performance gain for the practical problem sizes that is worth the time
invested in designing and implementing them.
33
The exploration of the practical usefulness of the asymptotically fast algorithms
began shortly after the initial ground-breaking discovery of such algorithms for ex
act integer, polynomial, and matrix arithmetic [60, 23, 84]. For instance, it was
discovered early that implementations of the Strassen algorithm for matrix multi
plication [84] do not yield theoretically expected results but still provide performance
gains for useful—although large—input data sizes [16, 66]. A lower crossover point
was found for the Strassen algorithm on supercomputers [8, 50]. More recently
crossover points were explored for several finite field linear algebra algorithms [30].
Filatei et al. discussed their crossover experiments for high performance implemen
tations for polynomial arithmetic [36]. Nonetheless, as with the tile method [59],
some computer algebra problems are better solved by a classical approach [79].
The consequences of recent developments in computer architecture (pipelining,
superscalar execution, speculative computation, and multilevel memory hierarchy)
are seldom taken into consideration when designing the fast algorithms or seeking
to improve the crossover points. For example, Schonhage [81, 80], Zuras [98], and
Montgomery [69] do not discuss the effect of the architecture features on classical,
Karatsuba, Toom-Cook, and FFT-based integer multiplication algorithms. On the
other hand, Fateman explored the LI cache behavior in his comparisons of sparse
polynomial multiplication methods [34].
Automatic tuning for the best crossover point for a particular platform is likewise
uncommon. For example, GNU-MP is the only library we utilized that has a facility
for automatic determination of crossover points for the multiplication algorithms
using the included tuneup.c program, see Section 1.6.3 below for more information.
More generally, von zur Gathen and Gerhard in their major Modern Computer
34
Algebra text [90] point out that the crossover point determination requires coding
and testing a large variety of algorithms. Previous work comparing asymptotically
fast computer algebra algorithms to their classical counterparts typically does not
take the underlying computer architecture into account. In this thesis, we have
shown that the architecture can dramatically effect performance, and hence should
be taken into account when making these comparisons.
1.6.3 Crossover points for G N U - M P , N T L , and SACLIB
The GNU-MP, NTL, and SACLIB libraries—all include asymptotically fast al
gorithms. Some algorithms such as integer and polynomial multiplication have
crossover points that are within currently useful input data ranges, while others do
not. However, only the GNU-MP package provides a mechanism for automatically
tuning the crossover points to a particular architecture. The SACLIB and NTL
libraries hard-code the crossover points for outdated platforms.
The GNU-MP 4.2 [44] multiplication and squaring routines call one of four al
gorithms: classical (base case), Karatsuba, Toom-3, and FFT-based [60, 98]. The
crossover points for the multiplication algorithms can be automatically determined
using tuneup . c program included in the library that is run during installation.
For example, on Pentium 4 machines the crossover constants for multiplication and
squaring respectively were determined to be at 18 and 68 words from classical to
Karatsuba, at 139 and 108 to Toom-3, and at 5888 and 6400 for FFT-based algo
rithms. The FFT variant threshold is quite large. The GNU-MP library does not
implement operations on polynomials.
The NTL [82] library versions 5.0 through 5.4 (the current version) use only
35
classical and Karatsuba algorithms for integer multiplication. The library uses hard-
coded crossover points that were estimated for Sparc-10, Sparc-20, and Pentium-90
processors and are set at 16 words for multiplication and at 32 words for squaring.
No attempt is made to pre-tune the crossover point to a particular architecture.
No tuning method is available to the user. In order to avoid function calls and
loops, NTL multiplication is completely unrolled and optimized for small integers
of lengths < 3 words.
The NTL polynomial multiplication is carried out using a combination of the
classical algorithm, Karatsuba, the FFT using small primes, and the FFT using
the Schonhage-Strassen approach. The choice of algorithm depends on the coeffi
cient domain. The crossovers for polynomial multiplication are again hard-coded
in ZZXl.c source file to happen at degree 10 < n < 40 for Karatsuba and at de
gree 80 < n < 150 for FFT approaches; exact crossover depends on \A\oo of the
polynomial operands, see Definition 1.3.5.
The SACLIB [22] integer multiplication uses Karatsuba approach with crossover
from classical multiplication at the length of 14 words. SACLIB polynomial multi
plication, however, does not use Karatsuba algorithm.
1.6.4 Register t i l ing
Register tiling is an instance of loop tiling—a well-known loop transformation
used by the high-performance compilers for improving the utilization of memory
hierarchy and superscalar features of the processor. Register tiling groups the
operands, loads them into the machine registers, and operates on them utilizing
ILP without repeatedly referencing the memory. This achieves substantial perfor-
36
mance improvement. [7, 53, 54, 55]
There are many publications about register tiling due to the widespread use of
the technique. For example, an early one described improving register assignment
to subscripted variables in loops [15]. Another one suggested changing the shape
of the tiles for better processor utilization on multiprocessor platforms [47, 48].
Marta Jimenez et al. explored register tiling [53, 54] and introduced a cost-effective
algorithm to compute exact loop bounds for nonrectangular iteration spaces [55].
While register tiling is conceptually uncomplicated, the implementation for the
nonrectangular iteration spaces is problematic [53, 54, 55]. In fact, it is not clear
how the technique can be applied to computations with multi-word integers such as
classical Taylor shift by 1. There are no publications about register tiling multipreci-
sion addition. Without the introduction of signed digits, suspended normalization,
radix reduction, and delayed carry propagation, we would not be able to tile classical
Taylor shift by 1, see Chapter 3. Without these domain specific transformations,
standard compilers would be unable of tiling the code.
1.6.5 Compilers and automat ic code generat ion and tuning
Current compilers such as those from GNU, Intel, and Sun Microsystems [40,
24, 85], however efficient, typically cannot generate code that is more efficient than
hand-tuned code [94]. This is true even for a simple kernel like matrix multiplica
tion. There are many techniques for transforming high-level programs into programs
that run efficiently on modern high-performance architectures such as linear loop
transformations loop tiling [61], and loop unrolling [7] for enhancing locality and
parallelism. There are also many methods for estimating optimal values for pa-
37
rameters associated with these transformations, such as tile sizes [54], and loop
unroll factors [7]. Manual optimization, however, still remains the best method for
achieving high performance [37].
Manual optimization, however, can be automated, i.e., human participation can
be reduced or even eliminated from the process. A process of writing and timing
several versions of a particular program or an algorithm can be replaced with auto
matic code generation and search using empirical run times or a performance model.
The programmers writing the generator may also use their architectural insights and
domain knowledge to limit the number of versions that are automatically generated
and evaluated.
Self-adapting code has been developed to automatically generate and optimize
the implementation of important classes of algorithms [70]. A number of recent
projects such as F F T W [38], ATLAS [1, 92, 93], and SPIRAL [76] have an automatic
code generator and evaluator. These library generators produce much better code
than native compilers do on modern high-performance architectures. Thus, code
generation with automatic tuning has become state-of-the-art.
1.6.6 Real root isolation
A primary application for Taylor shift by 1 is polynomial real root isolation,
which spends nearly all its computing time performing the operation. Some years
after Collins and Akritas [21] proposed an algorithm for polynomial real root isola
tion, Lane and Riesenfeld [64] presented a variant of the algorithm that uses Bern
stein bases instead of monomial bases. Both methods proceed recursively and use
the Descartes rule of signs as a termination criterion. For any input polynomial,
38
the two methods compute the same isolating intervals since they generate the same
recursion tree. The recursion tree was analyzed by several authors, most recently
by Krandick and Mehlhorn [63].
The monomial variant of the Descartes method was initially analyzed by Us-
pensky [87], Ostrowski [74], Collins and Akritas [21], and Collins and Loos [18].
Collins and Johnson [17] showed that the computing time is dominated by n6.
Johnson [56, 57] later realized that a root separation theorem by Davenport could
be used to reduce the computing time bound to n5; his proof contained a gap that
was removed by Krandick [62]. The Bernstein variant of the Descartes method
was analyzed as n6 by Mourrain, Vrahatis and Yakoubsohn [72] and later repeated
by Basu, Pollack, and Roy [9] and the result restated by Mourrain, Rouillier and
Roy [73]. Basu, Pollack and Roy improved their analysis in the second edition of
their book [10]. Eigenwillig, Sharma, and Yap recently showed that the computing
time for the Bernstein variant is also n5 [31].
The computing times of the monomial and the Bernstein variants of the Descartes
method have never been compared empirically and fairly. To our knowledge, no
published work used modern hardware profiling methods and architecture-aware
optimization techniques to compare the two variants. The technique of register
tiling that makes the classical Taylor shift by 1 efficient [59] carries over to de
Casteljau's algorithm, a fundamental method in computer-aided design [33]. We
are not aware of any architecture-aware implementation of de Casteljau's algorithm.
De Casteljau's algorithm is also the main subalgorithm of the Bernstein variant of
the Descartes method. Using such high-performance implementations of the two
algorithms would yield a fairer comparison between the two variants of the Descartes
39
method. In Section 6.2.3, the results are presented.
1.6.7 Notes on experimental methodology
All current processors allow the user to monitor a wide range of hardware events.
Such hardware counters can be used for precise real run-time measurements of per
formance metrics such as processor cycles, pipeline stalls, cache behavior, and branch
misprediction. The counters can be used for tuning, compiler optimization, debug
ging, benchmarking, monitoring, and performance modeling. While these techniques
are becoming widely used [96, 95], we did not find any computer algebra papers that
use performance counter measurements apart from the papers by Richard Fate-
man [34, 35].
40
2. Straightforward implementations
In this chapter we discuss straightforward implementations of classical Taylor
shift by 1 as well as the multiprecision addition routines they call.
2.1 Introduction
We call an implementation of classical Taylor shift by 1 straightforward if it uses
a generic integer addition routine to compute one of the following sequences of the
intermediate results Definition 1.3.2.
1. Horner's scheme—descending order of output coefficients
(ao ,0 , ^0,1) ^1,0) «0,2, « 1 , 1 , «2,0 , • • • , «0,ra, ■ ■ ■ , O-n,0)-
2. Horner's scheme—ascending order of output coefficients
(ao ,0 , « l , 0 j «0 , l j «2,0j 0>1,1, <k),2i • • • , O>n,0i • • • j O>0,n)-
3. Synthetic division—ascending order of output coefficients
(^0,0) a l , 0 j • • • j O>n,0i ^0,1) • • • ) Qn-1 ,1) • • • j «0,ra)-
41
for i = 0 , . . . , n bt < - at
assertion: bt = an_8)_i for j = 0 , . . . , n — 1
for i = n — 1,... ,j bl*-bl + bl+i assertion: b% = an-
Figure 2.1: The straightforward method we consider uses integer additions to compute the elements of the matrix in Figure 1.1 from top to bottom and from left to right.
4. Descending order of output coefficients
,n—l j • • • j Q"n,G)'
Von zur Gathen and Gerhard use method (1) [42]. The computer algebra system
Maple [65, 68, 67], version 9.01, uses method (3) in its PolynomialTools [T rans l a t e ]
function. In methods (3) and (4) the output coefficients appear earlier in the se
quence than in the other methods. The computing times of the four methods are
very similar; they differ typically by less than 10%.
In our experiments we will use method (3) to represent the straightforward meth
ods; Figure 2.1 gives the pseudocode. For addition we use the faster GNU-MP [43]
addition, unless noted otherwise.
We review the the GNU-MP [43] and NTL [83, 82] addition next. A review of
the SACLIB library [22] Taylor shift by 1 routine follows and concludes this chapter.
42
2.2 GNU-MP addition
The GNU Multiple Precision Arithmetic Library [43, 44] represents integers in
sign-length-magnitude representation. On the Pentium EE, Opteron, and Ultra
SPARC III platforms we have the package use the radix (3 = 264. On the Pentium 4
platform the package is set to use the radix (3 = 232.
The GNU-MP addition routine mpn_add_n is written in highly optimized, hand
crafted assembly code for most platforms. We present GNU-MP addition on Ultra
SPARC III as an example.
GNU-MP addition on UltraSPARC III
Let n be a non-negative integer, and let u = UQ + U\f3 + • • • + un/3n, where
0 < u% < (3 for alH G {0 , . . . , n} and un ^ 0. The magnitude u is represented as
an array u of unsigned 64-bit integers such that u [ i ] — u% for alH G {0 , . . . , n}. Let
v = v0 + ViP+- ■ ■Jrvn(3n be a magnitude of the same length. The routine mpn_add_n
is designed to add u and v in n + 1 phases of 4 cycles each. Phase i computes the
carry-in ct-\ and the result digit r% = (ut + v% + ct-i)mod(3. Figure 2.2 gives a
high-level description of the routine; all logical operators in the figure are bit-wise
operators. The UltraSPARC III has two integer execution units (IEU1, IEU2) and
one memory management unit (MMU). The GNU-MP addition routine adds each
pair of 64-bit words in a phase that consists of 4 machine cycles. Digit additions
are performed modulo (3 = 264; carries are reconstructed from the leading bits of
the operands and the result. In each set of the four successive phases the operation
address computes new offset addresses for ut+\, vt+\, and rt+\, respectively, during
43
cycle 1 cycle 2 cycle 3 cycle 4
IEUl a <-- (w4_! Vt>4_i) A - r 4 _ i a <— a V b Cl_! <- [a /2 6 3 J r 4 *■ - 6 + Cj_i mod/3 IEU2 6 <— M»_l A V»_l a d d r e s s 6 <— Mj + Vj mod /3 u% Vo , MMU l o a d M 1 + 3 l o a d vl+3 s t o r e rj_i —
Figure 2.2: The GNU-MP assembly addition routine for the UltraSPARC III platform.
the first three phases; in the fourth phase, the operation address is replaced by a
loop control operation. The routine consists of 178 lines of assembly code. In-place
addition can be performed. Whenever the sum does not fit into the allocated result
array, GNU-MP allocates a new array that is just large enough to hold the sum.
Gaudry's patch
In May 2006 we learned about a patch to the GNU-MP assembly routines for
the AMD64 architecture [97, 39]. The new assembly routines provide substantial
speedup for the GNU-MP addition, subtraction, and multiplication routines. We
confirmed that the GNU-MP addition is approximately 2x faster with the patch.
Figure 2.3 illustrates the performance improvement offered by the Gaudry's patch
for the straightforward method of Taylor shift by 1.
2.3 NTL addition
The NTL library [83, 82] represents integers using a sign-length-magnitude rep
resentation similar to the one GNU-MP uses. But while GNU-MP allows the digits
to have word-length, NTL-digits have 2 bits less than a word. As opposed to GNU-
MP, NTL needs 1 bit of the word to absorb the carry when it adds two digits. This
44
Speedup due to Gaudry patch on AMD Opteron (Straightforward method of Taylor shift by 1)
2
1 8
Q-1 6 T3 <D CD Q . W 1 4
1 2
10 2000 4000 6000 8000 10000
Degree
Figure 2.3: Performance gain due to Gaudry's patch for the straightforward method.
explains why NTL-digits are 1 bit shorter than GNU-MP-digits. Another bit is lost
for the following reason. While GNU-MP represents an integer as a C-language
s t ruc t , NTL represents it as an array and uses the first array element to represent
the signed length of the integer. Since all array elements are of the same type,
NTL-digits are signed as well—even though their sign is never used. Finally, due to
its way of performing multiplications, NTL cannot take full advantage of a 64-bit
word-length. In our experiments on all four platforms the NTL radix was 230. The
NTL addition routine ntl_zadd consists of 113 lines of C++ code.
2.4 The SACLIB method
The SACLIB library of computer algebra programs [22] performs classical Taylor
shift by 1 using the routine IUPTR1. The routine, consisting of 144 lines of C-code,
45
was originally written by G. E. Collins for the SAC-2 computer algebra system [19].
The method implements its own addition scheme, uses its own data structure, and
does not call an external addition routine. The method is faster than NTL- and
GNU-MP-based straightforward methods for polynomials of small degrees [59].
SACLIB represents integers with respect to a radix (3 that is a positive power
of 2. In our experiments, we set (3 = 262 on the Pentium EE, Opteron, and Ul
traSPARC III platforms and (3 = 229 on the Pentium 4 platform. Integers a such
that —(3 < a < (3 are called /5-digits and are represented as variables of type i n t
or long long. Integers a such that a < —(3 or (3 < a are represented as lists
(d0,..., dh) of /3-digits with a = J2t=0 dt(3% where dh ^ 0 and, for i G { 0 , . . . , / i},
dt < 0 if a < 0 and dt > 0 if a > 0.
SACLIB adds integers of opposite signs by adding their digits. None of these
digit additions produces a carry. The result is a list (d0,..., d^) of /5-digits that may
be 0 and that may have different signs. If not all digits are 0, the non-zero digit of
highest order has the sign s of the result. The digits whose sign is different from s
are adjusted in a step called normalization. The normalization step processes the
digits in ascending order. Digits are adjusted by adding s ■ (3 and propagating the
carry —s.
The routine IUPTR1 performs Taylor shift by 1 of a polynomial of degree n and
max-norm d by performing the n(n+ l ) / 2 coefficient additions without normalizing
after each addition. A secondary idea is to eliminate the loop control for each
coefficient addition. To do this the program first computes the bound n + L(d) of
Remark 1.3.7 for the binary length of the result coefficients. The program determines
the number k of words required to store n + L(d) bits. The program then copies
46
Step4: /* Apply synthetic division. */ m = k * (n + 1); for (h = n; h >= 1; h—) { c = 0; m = m - k; for (i = 1; i <= m; i++) { s = P[i] + P[i + k] + c; c = 0; if (s >= BETA) { s = s - BETA; c = 1; }
else if (s <= -BETA) { s = s + BETA; c = -1; }
P[i + k] = s; } }
Figure 2.4: Taylor shift by 1 in SACLIB.
the polynomial coefficients in ascending order, and in ascending order of digits,
into an array that provides k words for each coefficient; the unneeded high-order
words of each coefficient are filled with the value 0. This results in an array P
of k(n + 1) entries such that , for % G { 0 , . . . , k(n + 1) — 1} and % = qk + r with
0 < r < k, P[i + 1] = a„_g where ^ I Q an-qP3 ^s t n e coefficient of xn~q in the input
polynomial. After these preparations the Taylor shift can be executed using just the
two nested loops of Figure 2.4. The principal disadvantage of the method is the cost
of adding many zero words due to padding. This makes the method impractical for
large inputs. Also, the carry computation generates branch mispredictions.
47
3. The ti le m e t h o d
3.1 Introduct ion
In Johnson et al. [59], we presented a new version of Taylor shift by 1 algo
rithm. The introduction of signed digits, suspended normalization, radix reduction,
and delayed carry propagation enables our algorithm to take advantage of the reg
ister tiling technique for multiprecision addition. Register tiling—an optimization
method commonly used by high-performance compilers—groups the operands, loads
them into machine registers, and operates on the operands without referencing the
memory [7, 54]. We call our method the tile method.
The new register tile method for Taylor shift by 1 outperforms the straightfor
ward methods by reducing the number of cycles per word addition. We reduce the
number of carry computations by using a smaller radix and allowing carries to ac
cumulate inside a computer word. Further, we reduce the number of read and write
operations by performing more than one word addition once a set of digits has been
loaded into registers. This requires changing the order of operations; only certain
digits of the intermediate integer results al>3 in Definition 1.3.2 are computed in one
step. We perform only additions; signed digits will implicitly distinguish between
addition and subtraction. The new algorithm was written in a high-level language.
The tile method routine consists of 275 hand-written lines of C-code. In addition,
we developed a code generator to automatically unroll and schedule some parts of
the code, which further improves performance, see Section 3.5.
48
3.2 Description of the algorithm
We partition the set of indices In of Definition 1.3.2 as shown in Figure 3.1 (a).
Definition 3.2.1. Let n, b be positive integers. For non-negative integers i,j let
Ttt3 = {(h, k)eln\ [h/b\ = % A [k/b\ = j},
and let T be the set of non-empty sets T%3.
Remark 3.2.2. The set T is a partition of the set of indices In; some elements of T
can be interpreted as squares of sidelength b, others as triangles and pentagons.
Definition 3.2.3. Let Th3 E T. The sets of input indices to T%3 are
Nt>3 = {(h,k)eln\h = ib-lA [k/b\ = 3},
Wl>3 = {(h, k) Eln\ [h/b\ =iAk = jb-l}.
The sets of output indices for Th3 are
St>3 = {(h, k) e ln | h = tb + b - 1 A [k/b\ = j},
E%)3 = {(h, k) eln\ [h/b\ =iAk = jb + b-l}.
Remark 3.2A. Clearly, Nh3 = St-it3, Wt>3 = Eh3-i whenever these sets are defined.
Definition 3.2.5. Let an>k be one of the intermediate integer results in Defini-
49
J-0,0 To,i J-0,2 J-0,3
Ti,o """ Tl,2
J-2,0 T2,i
■1-3,0
o A a — o
rQ-ri ?L x C a r r y j —j j j j—T—» f / Propagation
b.
Figure 3.1: a. Tiled Pascal triangle, b. Register tile stack. A register tile is computed for each order of significance. Carries are propagated only along lower and right borders.
tion 1.3.2, and let (3 be an integer > 1. We write
dhk — £4> where 0) < (3, and we define, for all i,j,
NS=Kk\(h,k)eNhJ}
and, analogously, W;r , S\ , and E,r* hj ' hj hj
Let / = max{? | T%3 E T} and J = max{j | T%3 E T}. The tile method computes,
for % = 0 , . . . , / and j = 0 , . . . , J — i, the intermediate integer results with indices
in Sh3 U E%3 from the intermediate integer results with indices in Nl>3 U Wl>3. The
computation is performed as follows. A register tile computation at level r takes
N^' and W!f as inputs and performs the additions described in Figure 3.2 (a);
50
the additions are performed without carry but using a radix B > f3. Once the
register tile computations have been performed for all levels r, a carry propagation
transforms the results into S^ and E^J for all levels r. Referring to Figure 3.1 (b)
we call the collection of register tile computations for all levels r a register hie stack.
The maximum value of r for each stack of index (i,j) depends on the precision of
the stack which we now define.
Definition 3.2.6. The precision L* of the register tile stack with index (i,j) is
defined recursively as follows.
L*_1>3 = max({L(ah>k) | (h, k) E N0>3}),
L*_i = max({L(ah>k) | (h, k) E Wt>0}),
L*ttJ = max({L*_1>3, L ^ . J U {L(ah>k) \ (h, k) E Sh3 U Eh3}).
To facilitate block prefetching, we place the input digits to a register tile next to
each other in memory. We thus have the following interlaced polynomial representa
tion of the polynomial A(x) in Definition 1.3.2 by the array P. If % is a non-negative
integer such that % = g(n + 1) + / and 0 < / < n + 1 then P[i] contains the value
df_i defined in Definition 3.2.5.
Theorem 3.2.7. The computation of a register tile requires at most L{(3—1) + 26 —2
bits for the magnitude of each intermediate result.
Proof. Let n = 2b—I, and let Bn>p_i(x) be the polynomial defined in Definition 1.5.1.
For all (h, k) E T0)0 we have 0 < h, k < b - 1. Then, by Theorem 1.3.6, L(aKk) <
L(p-l) + h + k<L(p-l) + 2b-2. □
51
Theorem 3.2.8. If L(B) > L{(3 — 1) + 2b — 2 and 1 bit is available for the sign,
then the tile method is correct.
Remark 3.2.9. The UltraSPARC III has a 64-bit word. We let b = 8, (3 = 249, and
B = 263. For other platforms, see Section 3.5.
3.3 Properties of the algorithm
Theorem 3.3.1. The tile method has the following properties:
1. Assuming the straightforward method must read all operands from memory
and write all results to memory, the tile method will reduce memory reads by
a factor of b/2, and memory writes by a factor of b/4.
2. Given a processor architecture capable of concurrent execution of 2 integer
instructions and 1 memory reference instruction with a memory reference la
tency of at least 2 cycles, ab x b register tile computation takes at least y + 7
processor cycles.
Proof. (1) Obvious. (2) In the register tile, the addition at the SE-corner must follow
the other b2 — 1 additions, and the addition at the NW-corner must precede all other
additions. The first addition requires two summands in registers, which takes at least
3 cycles for the first summand and 1 more cycle for the second summand. The last
sum needs to be written to two locations; the first write requires 3 cycles and the
second 1 more cycle. Since we can perform the other b2 — 2 additions in *■ ~ ' cycles,
the register tile will take at least 3 + 1 + (b2 - 2)/2 + 1 + 3 = b2/2 + 7 cycles. □
52
I I I I I I I I 0+ ,+ 8+
,+ 2+ 9+
Z+ 3+ ,0+
5 6^ i r
6+ , + ,4+
, + 8+ ,5+
-m
9+ , 6 +
,0+ „+ „+ ,8+
,2+ , 9 +
1 * 2 *
,4+zt 1 * 22*"
,«+ „+ 1* 8* 15' 16* 23 ' 24* 3l" 321
1 * 24"" 2 * T
1 * 2 * 2 *
1 * 2 * 2 *
2 * 2 * 2 *
2 * 2 * 2 *
2 * 2 * 3 *
2 * 3 * 3 *
ttttt l oad / ' load add
y X
add store
^ / store
b.
Figure 3.2: a. A scheduled 8 x 8 register tile. Arrows represent memory references, "+" signs represent additions. Numbers represent processor cycles. The 2 integer execution units (IEUs) perform 2 additions per cycle, b. Register tile: a sketch of the proof of Theorem 3.3.1.
The 8 x 8 register tile computation should take at least 82/2 + 7 = 39 processor
cycles, see Figure 3.2 (b). By unrolling and manually scheduling the code for the
register tile, the code compiled with the Sun Studio 9 C compiler and the optimiza
tion options -fast -xchip=ultra3 -xarch=v9b required 53 cycles. When the compiler
was used to schedule the unrolled code, the computation required 63 cycles. Also
see Section 4.4 for more recent measurements.
3.3.1 Register tile schedule
An example of optimal schedule for UltraSPARC II and III processors is pre
sented in Table 3.1. Except for the initial and final additions, all additions within
the register tile are scheduled in pairs to fully utilize UltraSPARC III processor's
two IEUs. This schedule assumes that loads and stores require a 3 cycle latency.
According to the schedule, it will take 40-41 cycles to execute one register tile on the
processor. Most modern processors are capable of at least two integer operations
53
0 0 0 0 0
148 ->■ 148 148 148 148 148 -192 - > ■ - 4 4 104 252 400
33 - > ■ - 7 7 27 279 15 - > ■ - 6 2 - 3 5 3 - > ■ - 5 9
Figure 3.3: Pascal's triangle for the example polynomial. The coefficients of the polynomial and the elements of the triangle are in decimal representation.
per processor cycle and will yield similar schedules. A similar schedule was used in
all our experiments with the tile method.
3.3.2 A n example
In order to illustrate the register tile stack, we provide an example of register
tiling Taylor shift by 1 computation for a small polynomial of degree 4. Let
A(x) = 148a;4 - 192a;3 - 33a;2 + 15a; + 3
be the input polynomial. Then,
B(x) = A(x + 1) = 148a;4 + 400a;3 + 279a;2 - 35a; - 59
will be the output polynomial. Pascal's triangle for the example computation is
provided in Figure 3.3.
For the purpose of this example, let us use octal representation for the integer
coefficients, i.e., the radix will be (3 = 8. Let us further assume that the CPU
54
Cycle MMU IEU1 IEU2 1 load a l 2 load zl 3 load a2 4 load z2 5 load z3 z l + = a l 6 load z4 z2+=zl z l+=a2 7 load z5 z3+=z2 z2+=zl 8 load z6 z4+=z3 z3+=z2 9 load z7 z5+=z4 z4+=z3 10 load z8 z6+=z5 z5+=z4 11 load a3 z7+=z6 z6+=z5 12 load a4 z8+=z7 z7+=z6 13 store al (=z8) z l+=a3 z8+=z7 14 store a2 (=z8) z2+=zl z l+=a4 15 z3+=z2 z2+=zl 16 z4+=z3 z3+=z2 17 z5+=z4 z4+=z3 18 load a5 z6+=z5 z5+=z4 19 load a6 z7+=z6 z6+=z5 20 z8+=z7 z7+=z6 21 store a3 (=z8) z l+=a5 z8+=z7 22 store a4 (=z8) z2+=zl z l+=a6 23 z3+=z2 z2+=zl 24 z4+=z3 z3+=z2 25 z5+=z4 z4+=z3 26 load a7 z6+=z5 z5+=z4 27 load a8 z7+=z6 z6+=z5 28 z8+=z7 z7+=z6 29 store a5 (=z8) z l+=a7 z8+=z7 30 store a6 (=z8) z2+=zl z l+=a8 31 store zl z3+=z2 z2+=zl 32 store z2 z4+=z3 z3+=z2 33 store z3 z5+=z4 z4+=z3 34 store z4 z6+=z5 z5+=z4 35 store z5 z7+=z6 z6+=z5 36 store z6 z8+=z7 z7+=z6 37 store a7 (=z8) z8+=z7 38 store a8 (=z8) 39 store z7 40 store z8
Table 3.1: An optimal instruction schedule for the 8 x 8 register tile for the UltraSPARC III processor.
55
Decimal Octal Level of significance 0 1 2
(24 148 4 2 2 a3 -192 0 0 - 3 a2 -33 - 1 - 4 0 d\ 15 7 1 0 a0 3 3 0 0
Table 3.2: An example polynomial with its coefficients in the interlaced representation.
hardware registers be 6 bits wide, i.e., the addition in the hardware will be performed
using radix B = 25, i.e., L(B) = 5 (allowing 1 bit for the sign). Therefore, there will
be a space of 2 bits for storing the accumulating carries during addition performed
without carry propagation, see Theorem 3.2.7.
Octal digits for the example polynomial A(x) are presented in Table 3.2. The
sign of the integers is embedded in the digits as illustrated. The digits are all
normalized, i.e., all nonzero digits of a coefficient have the same sign—the sign of
the coefficient.
Figures 3.4, 3.5, and 3.6 show Pascal's triangles for Taylor shift by 1 computed
for digits of the levels of significance 0, 1, and 2 respectively. See also Figure 1.1.
Some elements (e.g., 21, 23, 16, 8) in the triangles are digits to the radix B and not
(3 because the carries were not propagated. In addition, some of the elements are
not normalized, i.e., the integers are represented with digits of different signs.
Let us choose the 3 x 3 area in Pascal's triangles as illustrated in Figure 3.7,
3.8, and 3.9. These blocks are register tiles of size 3 x 3 and together they form a
56
0 0 0 0 0
4 -i
->■ 4 1 4
i 4
i 4
1 4
0 -->■ 4 8 12 16 - 1 -->■ 3 11 23 7 -->■ 1 0 21 3 -->■ 1 3
Figure 3.4: Pascal's triangle for the level of significance 0.
0 0 0 0 0
i 1 i i 1 2 -+ 2 2 2 2 2 0 -+ 2 4 6 8 - 4 -+ - 2 2 8 1 -+ - 1 1 0 -+ - 1
Figure 3.5: Pascal's triangle for the level of significance 1.
0 0 0 0 0
i 1 i i 1 2 -■» 2 2 2 2 2 - 3 -■» - 1 1 3 5 0 -■» - 1 0 3 0 -■» - 1 - 1 0 -■» - 1
Figure 3.6: Pascal's triangle for the level of significance 2.
57
4 4 4 4 8 12 3 11 23
Figure 3.7: Non-normalized register tile for the level of significance 0.
2 2 2 2 4 6
- 2 2 8
Figure 3.8: Non-normalized register tile for the level of significance 1.
register tile stack of height 3, see Figure 3.1(b).
Carry propagation is performed only on the two relevant sides of the register tile
stack as illustrated in Figures 3.10, 3.11, and 3.12. After the carry propagation the
integers are still not normalized and contain digits of different signs (see lower left
element in Figures 3.10, 3.11, and 3.12).
The full output polynomial will be likewise non-normalized as shown in Table 3.3
and will be normalized before the Taylor shift computation completes as shown in
Table 3.4. The lower right corner of the register tile stack already contain the output
coefficient 02 shown in Table 3.4.
2 2 2 - 1 1 3 - 1 0 3
Figure 3.9: Non-normalized register tile for the level of significance 2.
4 4
3 3 7
58
Figure 3.10: Normalized register tile for level of significance 0 after carry propagation; now the radix (3 = 8.
2 7
- 2 3 2
Figure 3.11: Normalized register tile for level of significance 1 after carry propagation; now the radix (3 = 8.
2 3
- 1 0 4
Figure 3.12: Normalized register tile for level of significance 2 after carry propagation; now the radix (3 = 8.
Decimal Octal Level of significance 0 1 2
CL4 148 4 2 2 a3 400 0 2 6 a2 279 7 2 4 <2i -35 5 3 - 1 a0 -59 5 0 - 1
Table 3.3: The output polynomial after Taylor shift by 1 computation with its coefficients not normalized.
59
Decimal Octal Level of significance 0 1 2
(24 148 4 2 2 a3 400 0 2 6 a2 279 7 2 4 d\ -35 - 3 - 4 0 a0 -59 - 3 - 7 0
Table 3.4: The output polynomial after Taylor shift by 1 computation with its coefficients normalized.
3.4 Performance
In the RAM-model of computation [6] the tile method is more expensive—with
respect to the logarithmic cost function—than straightforward methods. Indeed, by
reducing the radix the tile method increases the number of machine words needed to
represent integers and therefore requires more word additions than straightforward
implementations. However, modern computer architectures [46, 14] are quite differ
ent from the RAM-model. In this section, we show that, on the UltraSPARC III
architecture, the tile method outperforms straightforward methods by a significant
factor—essentially by reducing the number of cycles per word addition. In Sec
tion 3.4.7 we compare our computing times with those published by von zur Gathen
and Gerhard [89, 41]. In Section 3.5 we show how the code for our method can be
automatically generated and tuned for any architecture.
60
3.4.1 Experimental methodology and platform
For this section, all experimental code was written in C and compiled with Sun
Studio 9 [85] compiler using -fast -xchip=ultra3 -xarch=v9b optimization options.
The GNU-MP package [43] (GMP) library version 4.1.2 was compiled with Sun
Studio 7 compiler and installed using the standard installation but with CFLAGS
set to -fast. All experiments were performed on a Sun Blade 2000 workstation, see
Sections 1.5.1 and 1.5.2.
3.4.2 Execution time
Figure 3.13 shows the speedup that the SACLIB and tile methods provide with
respect to the straightforward method for the input polynomials Bnd with d =
220 — 1, see Definition 1.5.1. The tile method is up to 7 times faster than the
straightforward method for low degrees and 3 times faster for high degrees. The
SACLIB method is up to 4 times faster than the straightforward method for low
degrees but slower for high degrees. The speedups are not due to the fact that the
faster methods avoid the cost of re-allocating memory as the intermediate results
grow. Indeed, pre-allocating memory accelerates the straightforward method by a
factor of only 1.25 for degree 50. As the degree increases that factor approaches 1.
In Figure 3.14 the polynomials Cn>d (with d = 220 —1, see Definition 1.5.1) reveal a
weakness of the tile method. The tile method does not keep track of the individual
precisions of the intermediate results a v but uses the same precision for all the
integers in a tile. The tile stack containing the constant term d of C22,d a n d C25td
consists of 28 and 3 integers ah}, respectively. Thus, when the degree stays fixed and
d tends to infinity, the tile method becomes slower than the straightforward method
61
Q .
T3 <D 4 CD Q . W
Performance - low degrees Speedup relative to straightforward method
50
1 1 ' 1 i '
/ / Tile method SACLIB method
-/ — Tile method SACLIB method
-/
>' Y i
-
i , i i , i
\
i
-
100
Degree 150 200
Q . 3 4
T3 CD CD
w 3
Performance - high degrees Speedup relative to straightforward method
2000
\ I I I I I
\ \ - r I i.1 I -\
SACLIB method
-
V—. SACLIB method . V—.
^^—v« i AA. A r
y y v
\ -
s _ I , I 1 , 1 ,
4000 6000
Degree 8000 10000
Figure 3.13: The tile method is up to 7 times faster than the straightforward method.
62
by a constant factor. The figure shows that—even when the degree is small—the
constant term d must become extremely large in order to degrade the performance.
3.4.3 Efficiency of addition
Figure 3.15 shows the number of cycles per word addition for the GNU-MP
addition routine described in Section 2.2. In the experiment all words of both
summands were initialized to 264 — 1, and the summands were prefetched into LI
cache. The figure shows that the intended ratio of 4 cycles per word addition is
nearly reached when the summands are very long and fit into LI cache; for short
integers GNU-MP addition is much less efficient.
Figure 3.16 shows the number of cycles per word addition for the GNU-MP-
based straightforward Taylor shift described in Section 2.2 and for the tile method
described in Section 3.2; the polynomials Bnd with d = 220 — 1 (see Definition 1.5.1)
were used as inputs. For large degrees the methods require about 5.7 and 1.4
cycles per word addition, respectively. Since the tile method uses the radix 249
and the straightforward method uses the radix 264 the tile method executes about
64/49 ~ 1.3 times more word additions than the straightforward method. As a
result the tile method should be faster than the straightforward method by a factor
of 5.7/(1.4 • 1.3) ~ 3.1. The measurements shown in Figure 3.13 agree well with this
expectation.
3.4.4 Memory traffic reduction
Figure 3.17 shows that the tile method reduces the number of memory reads
with respect to the straightforward method by a factor of up to 7. The polynomials
Performance - degree=22 Speedup relative to straightforward method
1 1 ' 1 i 1 i 1 i
.
1 1 '
Tile method
1
SACLIB method
1 1
-
i 1
\. — w
i 1 I , I , 200 400 600
Coefficient length in bits 800 1000
Performance - degree=25 Speedup relative to straightforward method
400 600 Coefficient length in bits
1000
Figure 3.14: For the input polynomials Cn,d the tile method computes a register tile stack at the precision required for just the constant term.
Efficiency of GNU-MP addition
64
200
o ^ 150 T3 T3 CO T3 i _
O 5 100 l _ CD Q . V)
_CD O 50 O
1 1 1 ' 1
-- Cycles per word addition --
■
— Cache miss rate -■ -
I 1 i ii
-
\ i M H
f» iV l i L ' l _ - A - j \ l \ ' \
1 1 " 1 , 1 i
0 025
0 02 CD
(/> 015 (/>
fc CD c-o 01 m o m ro 005 T3
50 100 150 Summand integer length in words
200
Efficiency of GNU-MP addition
20
B 15
I 5 10
Cycles per word addition
Cache miss rate
— — — — — — — J
0 25
2000 4000 6000 8000 Summand integer length in words
-- o
10000
Figure 3.15: In GNU-MP addition the ratio of cycles per word addition (left scale) increases with the cache miss rate (right scale).
Efficiency of Taylor shift by 1
140
120
T3 T3 CO
T3 i _ O
CD Q . V)
_CD
o O
100
80
60
40
20
I I I I I I I
Tile method Straightforward method Tile method Straightforward method -Tile method Straightforward method
M ll -11 \1
\ N>
I ' - 1 1 1 , 50 100
Degree 150 200
Efficiency of Taylor shift by 1
20
i l 5 T3 T3 CC
T3 i _ O 5 10 l _ CD Q . V)
_CD
° 5 o
1 1 1 ' 1 ' 1 '
Tile method Straightforward method Tile method Straightforward method
\ \
Tile method Straightforward method
\ \ \
"** -~
v ^ / \ 1 I , I , I ,
2000 4000 6000
Degree 8000 10000
Figure 3.16: In classical Taylor shift by 1 the tile method requires fewer cycles word addition than the straightforward method.
66
Bn>d with d = 220 — 1 were used as inputs. The number of memory reads in the
GNU-MP-based straightforward method is independent of the compiler since the
implementation relies to a large extent on an assembly language routine. However,
the number of memory reads in the tile method depends on how well the compiler
is able to take advantage of our C-code for the computation of register tiles. The
figure shows that the Sun Studio 9 C-compiler with the option -xOS -xarch=v9b
works best for the tile method.
3.4.5 Cache miss rates
Figure 3.18 shows the LI data cache miss rates for the straightforward method
and the tile method; the polynomials Bnd with d = 220 — 1 were used as inputs. As
the degree increases the cache miss rate of the straightforward method rises sharply
as soon as the polynomials no longer fit into the cache. The cache miss rate levels
off at about 13%. Indeed, by Section 3.4.1 the block size is 8 words; so, one expects
7 cache hits for each cache miss.
3.4.6 Branch mispredict ions
Figure 3.19 shows the number of branch mispredictions per cycle for the straight
forward method and the tile method; the polynomials Bnd with d = 220 — 1 were
used as inputs. Since either method produces at most one branch misprediction
every 200 cycles, branch mispredictions do not significantly affect the two methods.
However, the branch misprediction rate of the SACLIB method is 60 times greater
than that of the straightforward method when the degree is high.
67
Memory reads 1 1 ' 1 1 1
"
I f . ' • ^ . *■—
i "
/ ~v— ^JVB--i~_ * —*
1 , 1 1 50 100 150
Degree 200
Memory reads
in gcc v3 4 2 unoptimized
qcc v3 4 2 -Q3
Sun cc v5 6 -x03 -xarch=v9b Sun cc v5 6 -fast -xchip=ultra3 -xarch=v9b
2000 4000 6000
Degree 8000 10000
Figure 3.17: The tile method substantially reduces the number of memory reads required for the Taylor shift; the extent of the reduction depends on the compiler.
68
Taylor shift by 1
0 05
0 04
■ § 0 0 3 <D
<D
"§002 O
7 3 0 01
-1 1 1 1 1 1
-1
Tile method Straightforward method
\/1 1
Tile method Straightforward method
\/1 1
- -
/\ ^ V / N V~^ -^^^
\
1 i , i , 50 100
Degree 150 200
02
S W 0 15
T3 CO CD
Taylor shift by 1
Tile method Straightforward method
4000 6000 Degree
10000
Figure 3.18: For large degrees the tile method has a lower cache miss rate than the straightforward method. Moreover, the number of cache misses generated by the tile method is small because the tile method performs few read operations.
69
Taylor shift by 1
0 006
O > s 0 005 o
0 004 Q .
o o 15 0 003 <D l _ Q . w ' F 0 002 o nj 0001 00
1 1 ' 1 ' 1 '
1 I , A/ IAA/VA/* ^s^^^y]
1 ll 1 ll Tile method Straightforward method \\A Ai Tile method Straightforward method \\A Ai
\ \ (
1 , 1 , 1 , 50 100
Degree 150 200
Taylor shift by 1
0 005
O O 0 004 CD Q. (/> o "o T3 CD i_ Q. (/>
0 003
0 002
o CD 00
0 001
-I I I I I I I I
\ \ Tile method Straightforward method V
Tile method Straightforward method V
* ^ - _ - ~ % ' - - ~ - ' - - ^ . - _
-I
C- " " " --
I i , i , i , * ■ *
2000 4000 6000 Degree
8000 10000
Figure 3.19: The number of branch mispredictions per cycle is negligible for the tile method and the straightforward method.
70
straightforward tile NTL-addition GMP-add
degree UltraSPARC [89] Pentium III [41] UltraSPARC III 127 0.004 0.001 0.001 0.00076 0.00010 255 0.019 0.005 0.004 0.00327 0.00046 511 0.102 0.030 0.016 0.01475 0.00286 1023 0.637 0.190 0.101 0.08261 0.03183 2047 4.700 2.447 0.710 0.56577 0.27114 4095 39.243 22.126 4.958 3.73049 1.97799 8191 — 176.840 44.200 29.91298 18.48445
Table 3.5: Computing times (s.) for Taylor shift by 1 —"small" coefficients.
3.4.7 Computing times in the literature
Von zur Gathen and Gerhard [89, 41] published computing times for the NTL-
based implementation of the straightforward method described in Section 2.3. Ta
bles 3.5 and 3.6 quote those computing times and compare the NTL-based straight
forward method with the GNU-MP-based straightforward method and the tile
method.
The computing times we quote were obtained on an UltraSPARC workstation
rated at 167 MHz [89] and on a Pentium III 800 MHz Linux PC; the latter experi
ments were performed using the default installation of version 5.0c of NTL [41]. We
installed NTL in the same way on our experimental platform but while the default
installation uses the gcc compiler with the -02 option we used the Sun compiler with
the options -fast -xchip=ultra3. This change of compilers sped-up the NTL-based
straightforward method by factors ranging from 1.06 to 1.63.
Von zur Gathen and Gerhard ran their program letting k = 7 , . . . , 13 and n =
71
straightforward tile NTL-addition GMP-add
degree UltraSPARC [89] Pentium III [41] UltraSPARC III 127 0.006 0.002 0.001 0.00096 0.00016 255 0.036 0.010 0.005 0.00434 0.00099 511 0.244 0.068 0.029 0.02154 0.00838
1023 1.788 0.608 0.231 0.17607 0.09183 2047 13.897 8.068 1.773 1.27955 0.83963 4095 111.503 65.758 13.878 9.97772 6.27948 8191 — 576.539 140.630 151.27732 61.04515
Table 3.6: Computing times (s.) for Taylor shift by 1 —"large" coefficients.
2k — 1 for input polynomials of degree n and max-norm < n for Table 3.5, and max-
norm < 2n+1 for Table 3.6; the integer coefficients were pseudo-randomly generated.
We used the same input polynomials in our experiments.
The NTL-based straightforward method runs faster on the UltraSPARC III than
on the Pentium III, but the speedup ratios vary. This is likely due to differences
between the processors in cache size and pipeline organization. The computing
time ratios between the NTL- and GNU-MP-based straightforward methods on the
UltraSPARC III are more uniform and range between 0.9 and 1.7. If these computing
time ratios can be explained by the difference in radix size—230 for NTL and 264
for GNU-MP—then there is no justification for the use of assembly language in the
GNU-MP-addition routine. Again, the tile method outperforms the straightforward
methods.
72
3.5 Automat i c code generation
Predicting performance on modern computer platforms is difficult due to their
complexity, see Chapter 4. This challenge calls for automatic code generation and
tuning techniques where several variants of an algorithm are constructed and as
sessed for performance without human intervention. Modeling, however, helps to
narrow the search, see Section 4.4.2 [94].
A tile can be described by three parameters: tile size, tile shape, and addition
schedule [59]. The structure of the Taylor shift by 1 computation favors a rectangular
shape for the tiles; however, we performed experiments only with square tiles. More
than one addition schedule is possible. The number of possible schedules depends on
the ILP features of the target processor and increases with the number of available
integer execution units (IEUs). The optimal tile size depends on the number of
registers in the target CPU and the quality with which the CPU schedules memory
operations. We found that the best performing parameter values varied widely
depending on the target CPU.
We have implemented an automatic code generator in Perl [91]. The generator
produces portable ANSI C + + [52, 13] code for tiles of different sizes, compiles and
executes the code, and searches through successively larger tile sizes until a tile size
with the best performance is discovered. A single addition schedule is produced by
the code generator for a CPU with 2 addition units, see next paragraph. The code
generator consists of 1,000 lines of Perl code and produces approximately 11,000
lines of C + + code for each platform per each register tile size.
The purpose of the addition schedule was to expose the dependencies of the
73
computation so that the C++ compiler can then schedule the computation appro
priately for the target architecture. The compilers we tested were unable to infer
these dependencies when the computation is programed in for loops. We did not
test addition schedules that target CPUs with more than 2 IEUs.
We search over tiles of size n x n for n = 4, 6, 8, 10, 12, 14, and 16. Smaller
tiles are too small to offer a substantial speedup and larger tiles would cause register
spilling that would negate the locality of reference advantages of the tile method.
The outcome of the tile search is shown in Figures 3.20, 3.21, 3.22, and 3.23. The
Figures show performance gain (speedup) obtained by the tiled version of the Taylor
shift by 1 algorithm on the Pentium EE, Opteron, Pentium 4, and UltraSPARC III
platforms, respectively, for different register tile sizes. The speedup is calculated
with respect to the straightforward GMP-based implementation of Taylor shift by
1, see Section 2.2. The dips in the Opteron and UltraSPARC curves are discussed
in Section 3.5.2. Based on the outcome of the search, we use register tile sizes of
12 x 12 for the Pentium EE and Opteron, 8 x 8 on the UltraSPARC III, and 6 x 6
on the Pentium 4 in our experiments.
3.5.1 Processor utilization
The tile method performs more word additions but offers substantial gain in
performance[59]. Figures 3.24 and 3.25 show how the processor is being utilized
by word additions. The tile method dispatches substantially more word additions
per each processor cycle. This is the only cause for the substantial performance
difference between the method. Register tile sizes were set to 12 x 12 words for the
Pentium EE and Opteron, 8 x 8 words for the UltraSPARC III, and 6 x 6 words for
74
Impact of register tile size on performance Architecture Pentium EE
3 4 T3 <D
w 3
- - ■ ■ v .
4x4 6x6 8x8 10x10 12x12 14x14 16x16
) . ' / V - ^ s / -
2000 4000 6000 Degree
8000 10000
Figure 3.20: Impact of tile size on the performance of the tile method on Pentium EE processor. Legend: tile size in word x word.
Impact of register tile size on performance Architecture AMDOpteron
3 4 T3 <D Q-3 w 3
7ft .•■•. s v - ' 1 f f l l . .--f.i-. .- .-. .. ... .'rf » ' i ur
6x6 8x8 10x10 12x12 14x14 16x16 2000 4000 6000
Degree 8000 10000
Figure 3.21: Impact of tile size on the performance of the tile method on Opteron processor. Legend: tile size in word x word.
75
Impact of register tile size on performance Architecture Intel Pentium 4
Q-3 T3 <D <D Q . W 2
4x4 6x6 8x8 10x10 12x12 14x14 16x16
2000 4000 6000
Degree 8000 10000
Figure 3.22: Impact of tile size on the performance of the tile method on Pentium 4 processor. Legend: tile size in word x word.
Impact of register tile size on performance Architecture Sun UltraSPARC III
4x4 6x6 8x8 10x10 12x12 14x14
4000 6000
Degree 10000
Figure 3.23: Impact of tile size on the performance of the tile method on UltraSPARC III processor. Legend: tile size in word x word.
76
the Pentium 4. Initial coefficients were set to 2™ — 1.
We suggest several strategies for improving processor utilization by the tile
method and, hence, its performance:
1. Reschedule the register tile computation to utilize all available IEUs within
the target processor. For example, Opteron should be able to do 3 additions
per cycle as the processor has 3 IEUs.
2. A larger register file (i.e., more available registers) will allow for larger register
tiles. This would help further reduce the cost of carry propagation.
3. Using a processor with wider registers (i.e., 128-bit vs. 64-bit wide) will im
prove performance as wider registers allow for a larger radix. A larger radix
will shorten the integer coefficients and further reduce the number of additions.
4. It is not known whether the sparse interlaced array representation for poly
nomials is a more efficient data structure for the tile method than a noninter
laced representation. We conjecture that the interlaced representation is better
because it helps avoid cache thrashing and favors automatic block prefetch
ing [59]. The best representation should be determined experimentally.
3.5.2 The 4100 degree irregularity
The observed dip in the speedup and a spike in computing time at the degree
4100 for the tile method appears only in experimental data for UltraSPARC III and
Opteron processors.
Straightforward method - processor utilization 0 5
a , 0 4
o o
W 0 3
o
CO
I 0 2
0 1
Pentium EE Opteron with Gaudry's patch Opteron Pentium 4 UltraSPARC III
;V ' • \-^, J r.K./r - * - , V V v - V v ' v V v
r 2000 4000 6000 8000
Degree 10000
Figure 3.24: Processor utilization in word additions per cycle for the straightforward method.
Tile method - processor utilization
T ; 15 O
CO
■i 1 T3 T3 CO T3 i _ O > 05
- ■ V V V - A - , " / . / V A . A ^ A
Pentium EE Opteron Pentium 4 UltraSPARC II
it''"'..'
2000 4000 6000
Degree 8000 10000
Figure 3.25: Processor utilization in word additions per cycle for the tile method.
78
On UltraSPARC III processor, the tile method presents a dramatic increase in
LI data cache miss rate at the degree 4100. The increase is from 2% to almost 11%.
The UltraSPARC III processor's LI caches have block size of 32-bytes, i.e., each
block can contain 4 64-bit words [49, 86]. This implies an almost 50% LI cache miss
rate.
At the degree 4100, there is also a significant increase in the LI instruction cache
references and a significant decrease in branch mispredictions. Both of these obser
vations indicate that the instructions are not being fetched or executed efficiently
because the machine is waiting for the memory system to deliver the data.
The irregularity also appears for the GNU-MP addition of integers that are 4100
words long. On UltraSPARC III processor, there is a spike for q = 2 but not for
q = 3, see Section 4.2. On the Opteron processor (for GNU-MP both with and
without Gaudry's patch), there is a spike for q = 3 and a dip for q = 2. However,
this surprising behavior could be a compiler idiosyncrasy.
This anomaly is likely caused by the low set associativity of the UltraSPARC III
and Opteron data LI caches, which are respectively 4-way and 2-way set associa
tive. The Pentiums have 8-way set associative LI caches and do not exhibit the
irregularity. Conceivably a number-theoretic reason may cause the cache to exhibit
a substantial increase in conflict misses at this particular degree.
79
4. Modeling Taylor shift by 1
4.1 Introduction
In this chapter, we present a model for GNU-MP addition and models for the
straightforward and tile methods of Taylor shift by 1 with respect to target archi
tecture. We compare the models' predictions to the experimental data.
A performance model is a function of several variables that represent features
and parameters of the target microprocessor architecture: specifically, the number
of available integer execution units (IEUs), the memory management unit (MMU)
latency of the processor, the superscalar capacity of its pipeline, and the architecture
of caches in the processor.
The modeling explains performance advantages of the tile method with respect
to the straightforward method and suggests automatic code generation and tuning,
see Section 3.5.
4.2 A model for GNU-MP addition
The high-performance GNU-MP addition routines are written in assembly code
for most architectures. A generic version is also provided. The assembly code is
highly optimized and aims toward optimal utilization of superscalar features of the
target processor as well as minimization of stall cycles due to memory access delays
(cache misses) and branch mispredictions.
Our model for GNU-MP addition assumes that
80
1. a certain number of cycles are consumed for each two words (digits) to be
added, i.e., cycles per word addition,
2. a certain number of cycles are spent for the overhead of calling the addition
routine and housekeeping within the routine, and
3. a certain number of cycles may be spent waiting for memory hierarchy to
deliver the data.
Let C(L) be the number of cycles it takes to add two L-word integers. In our
model, we assume that C(L) = cL + h + 7r/x(L) is a function of c, h, L, / /(£), and
7r, where c is the number of cycles per word addition, h is the combined cost of the
overhead of calling the GNU-MP addition routine and of housekeeping within the
routine, L is the length of integers in words (or the length of the larger operand),
fi(L) is expected number of LI cache misses for a fully associative cache (see below),
and 7r is the cache miss penalty in cycles. These parameters are explained below.
Experimental data for C(L), however, may be influenced by factors not accounted
for in the model, e.g., particulars of the LI data cache design (i.e., associativity,
number of ports) or other processes running on the target machine.
Table 4.1 lists the parameters used in modeling GNU-MP addition. The cycles
per word addition c is known via one or more of the following: learned from available
GNU-MP documentation, measured experimentally, or estimated by studying the
code. The overhead h must be empirically measured because it varies with the target
architecture. In fact, the processor architecture, memory organization, compiler
quality, and compilation flags used—all affect the overhead. The Table 4.1 provides
values for c and h for two cases: the summands are cleared from or preloaded into
81
the LI data cache before running GNU-MP addition. We call these instances "cold"
and "warm" caches respectively.
Let the LI cache capacity be K bytes, w the size of the GNU-MP word (digit) in
bytes, A the LI cache block size (cache line) in bytes, and q the number of GNU-MP
integer operands. Then, assuming a fully associative LI data cache, the expected
number of LI cache misses fi{L) is a piece-wise function defined as follows:
{ 0 if qwL < K,
^f^ otherwise.
In addition, fi{L) may also be near 0 if the processor is capable of and succeeds
with speculative execution of memory reference instructions, see below. The LI
cache miss penalty n is a parameter determined by the target memory hierarchy
organization and technology and is identical to the L2 cache latency.
In our model and experimental measurements of GNU-MP addition, one of the
summands is also the destination sum, i.e., q = 2. There is virtually no difference
in performance for such in-place addition (two operands) and addition with three
operands for Pentium EE and UltraSPARC III processors. However, in-place ad
dition is 5 — 20% slower for Pentium 4 and Opteron processors (with and without
Gaudry's patch). We do not consider the L2 cache because the two GNU-MP in
teger operands of L < 10, 000 words will fit in the L2 cache and cause no misses.
In the straightforward method we use the in-place addition; the Table 4.1 lists only
those parameters.
The GNU-MP addition assembly routines prefetch data to avoid memory refer-
82
Description (Parameter) Value
Pentium EE Opteron Pentium 4 UltraSPARC III
w/patch w/o patch
Cycles per word (warm) (c) 11.5 cycles 1 667 cycles 3 cycles 7.25 cycles 4.5 cycles
Overhead (warm) (h) 170 cycles 60 cycles 61 cycles 200 cycles 300 cycles
Cycles per word (cold) (c) 115 cycles 6 cycles 5 8 cycles 7 5 cycles 8 25 cycles
Overhead (cold) (h) 325 cycles 250 cycles 140 cycles 375 cycles 425 cycles
LI data cache size (K) 16 kb 64 kb 16 kb 64 kb
LI data cache block size (A) 64 bytes 64 bytes 32 bytes 32 bytes
LI cache associativity 8-way 2-way 8-way 4-way
LI cache miss penalty (IT) 7 cycles 14 cycles 7 cycles 20 cycles
Effective miss penalty 0 cycles 14 cycles 7 cycles 20 cycles
Table 4.1: Parameters used for modeling GNU-MP addition. The patch refers to Gaudry's patch [39].
ence stall cycles [44]. For relatively old processors such as UltraSPARC III, such
data prefetching is not effective when the size of the summands exceeds the LI
cache size and the resulting capacity misses cause pipeline stalls. On newer proces
sors such as Pentium 4 and Pentium M as well as the current dual-core AMD and
Intel processors (e.g., Opteron and Pentium EE) the speculative execution of mem
ory reference instructions such as the address prediction, load value prediction, and
stride prediction are used to reduce the latency of load instructions by aggressively
prefetching source operands.
Figures 4.1, 4.2, 4.3, 4.4 and 4.5 show plots for both experimentally measured
and modeled arithmetic efficiency (C(L)/L) under both the favorable (warm) and
adverse (cold) LI data cache conditions. For long integer addition, the "warm"
cache data was obtained after reading the operands once before each measurement
in an attempt to preload the LI cache. For the modeled data, values for c and h
83
were extracted from the short integer addition data by the "best fit" method. For
long integer addition, the Lf data cache wait cycles were added as described in the
discussion above. All words of both summands were initialized to 264 — f. Very few
outliers were present in measurements of short integer addition; these outliers were
removed. The short integer addition was performed in 10-plicate. The long integer
addition was performed in triplicate.
This model gives excellent results for the short integer addition. For the long
integer addition, it gives excellent results for the Intel processors and good results
for the UltraSPARC III and Opteron processors.
The straightforward method of Taylor shift by 1 (see Sections 2 and 4.3) exercises
the GNU-MP addition for short integers only (see Figure 4.6) that will fit in the LI
data cache. However, the entire polynomial will not fit in the cache. An attempt to
understand what happens when GNU-MP addition is called in the context of the
straightforward method for Taylor shift by 1 prompted the decision to provide data
for both the "cold" and "warm" cache conditions.
4.3 Modeling the straightforward method
Our model for the straightforward method is based on the above model of GNU-
MP addition, see Section 2 for the definition of the method and Figure 2.1 for the
pseudocode for the method.
Let S be the number of processor cycles required to perform computation for the
straightforward method of Taylor shift by 1. For our model, we assume that S is a
function of degree n of the polynomial, initial size of its coefficients k < L2(\A\00) in
bits (see Definition 1.3.5), and the cycle cost of GNU-MP addition C(L) as discussed
84
Modeling GNU-MP addition on UltraSPARC
50
40
30
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
100 200 300 400 Length of summands in words
500
Modeling GNU-MP addition on UltraSPARC
30
25
20
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
2000 4000 6000 8000 Length of summands in words
10000
Figure 4.1: Modeling GNU-MP addition for the UltraSPARC III processor.
Modeling GNU-MP addition on Pentium 4
50
40
30
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
100 200 300 400 Length of summands in words
500
Modeling GNU-MP addition on Pentium 4
20
15
10 o
1 ' 1 1 1 1 1
i ' i ' i ' 1 ' 1 1 1 1 1
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
1 1 1
.1 » \1 «
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
1 1 1
.1 » \1 «
.^4^ .^4^ W ^ W ^ W W T H A ^ T T
i I , I , I , 2000 4000 6000 8000
Length of summands in words 10000
ure 4.2: Modeling GNU-MP addition for the Pentium 4 processor.
86
30
25
20
15
o 10
Modeling GNU-MP addition on Opteron without Gaudry's patch
I I I I I I I I I
1 Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
-
1 1
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache) -
\
1 I , I , I , 100 200 300 400
Length of summands in words 500
10
O
Modeling GNU-MP addition on Opteron without Gaudry's patch
I +
A ftJ.-r.T'V -"' •v^v.^vMV^^^^\w^>^.i'^Li^^l^)t^,I^>
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)
2000 4000 6000 8000 Length of summands in words
10000
Figure 4.3: Modeling GNU-MP addition without Gaudry's patch for the Opteron processor.
87
Modeling GNU-MP addition on Opteron with Gaudry's patch
30
25
20
15 O
10
Measured (cold cache) Modeled (warm cache) Measured (warm cache) Modeled (cold cache)
-St >s< *^ww„ «f-.«
100 200 300 400 Length of summands in words
500
10
O
Modeling GNU-MP addition on Opteron with Gaudry's patch
-> . i l - L H <UI ,+X>y>fV»o...y
Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache) _L _L
2000 4000 6000 8000 Length of summands in words
10000
Figure 4.4: Modeling GNU-MP addition with Gaudry's patch for the Opteron processor.
Modeling GNU-MP addition on Pentium EE
O 20
Measured (cold cache) Measured (warm cache) Modeled (cold cache) Modeled (warm cache)
~~d^^^UL), -
100 200 300 400 Length of summands in words
500
Modeling GNU-MP addition on Pentium EE
30
25
20
o
15
10
SM -H
Measured (cold cache) Measured (warm cache) Modeled (cold cache) Modeled (warm cache)
W. m iA ..Jw'>^.,U,vr\L... 2000 4000 6000 8000
Length of summands in words 10000
ure 4.5: Modeling GNU-MP addition for the Pentium EE processor.
89
in Section 4.2 above. Experimental measurement of S likely depends on additional
factors not accounted for in this model, such as the LI data cache behavior while
accessing the array of GNU-MP integers that represent polynomials.
Each element of Pascal's triangle has a binary length of L 2 (a v ) < k + % + j
by Theorem 1.3.6(2). The length of the longest summand is L2 < L(ah}). The
straightforward methods perform n(n+ l)/2 additions to compute the elements, see
Section 1.3. Therefore, we obtain a computing time bound as follows:
k + i w
) S = £ \ C ( l<i<n
We will ignore memory wait cycles since the LI cache miss penalty is negligible for
GNU-MP integers with L < 500; the integers in our experiments never grow longer
than 500 words. Therefore, substituting C(L) = cL + h (see Section 4.2 above) and
assuming that \^f\ = ^ , we get
s = £ l(c(£±!) +/o Ki<n
s= y (ic(—)+ih) *-^ w
Ki<n
S= > — + — + ih) Ki<n
s = - y t+-y t2+hy % Ki<n Ki<n Ki<n
90
w z—' w z—' l<i<n l<i<n
OK f
S=(— + h)n(n + l)/2 + -n(n + l)(2n + l)/6.
Since S is an upper bound on the computing time (in cycles) for the straightfor
ward method, we expect our model to overestimate the computing time. In addition,
we provide modeled performance data based on run-time L that were experimentally
measured for each addition; we add the corresponding computing time for GNU-MP
addition to the total computing time. We do this with values for c and h for both
warm and cold cache.
The modeled data were generated using parameters described in the above Sec
tion 4.2; initial coefficients were assumed equal to 2™ — 1 where n is the degree. The
data presented in Figures 4.7, 4.8, 4.9, and, 4.11 show reasonable agreement between
modeled and measured performance. The greater discrepancy between the modeled
and the measured data on the Intel processors is probably caused by the processors'
more sophisticated out-of-order execution and hardware prefetching units.
A histogram that shows the distribution of the length of sums in the GNU-MP
based straightforward method for both the experimental and the modeled data is
provided in Figure 4.6. Since the model incrementally overestimates the length of
the sums, they grow as the computation progresses. The experimental execution
does not have as many long additions as the modeled execution assumes. The real
execution "shifts" longer additions toward the middle of the Pascal triangle.
91
Histogram of the number of additions 64-bit straightforward method of Taylor shift by 1
5e+05
w o
CO
CD
E
4e+05
3e+05
2e+05
1e+05
in x Measured - degree 1000 Measured - degree 2000 Measured - degree 4000 Measured - degree 8000 Modeled - degree 1000 Modeled - degree 2000 Modeled - degree 4000 Modeled - degree 8000
/
/ ** * N
\
V+ / / \ i
V _^Z L 50 100 150 200
Length of the sum in words 250
Figure 4.6: The distribution of the length of sums L in the straightforward method. Both experimental and modeled data provided.
Modeling straightforward method for UltraSPARC
CD E i _ CD Q . X CD V) jD O > s
o
o o
T Warm cache with modeled L Warm cache wtih measured L Cold cache with modeled L Cold cache with measured L
4000 6000
Degree 10000
Figure 4.7: Modeling the straightforward method for the UltraSPARC III processor.
92
Modeling straightforward method for Pentium 4
E <D Q . X <D (/> jD
o O
w o O
T~ T~ Warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L
•-.. V
2000 4000 6000 Degree
8000 10000
Figure 4.8: Modeling the straightforward method for the Pentium 4 processor.
CD E i _ CD Q . X CD V)
j D O
o
O o
Modeling straightforward method for Opteron without Gaudry's patch
•*v> T" T"
\ Warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L
4000 6000 Degree
10000
Figure 4.9: Modeling the straightforward method for the Opteron processor.
93
CD E
■ i _ <D Q . X <D w jD O O
w O o
Modeling straightforward method for Opteron with Gaudry's patch
\ % \
1 ' 1 ' 1 ' ■ \
% \ \ A / l_ XL. - 1 1 - 1 1
■
\ \
warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L
■
* % ■■■«. " " - " « - . ^
-
-
1 1 1 , 1 , 1 , 2000 4000 6000
Degree 8000 10000
Figure 4.10: Modeling the straightforward method for the Opteron processor with Gaudry's patch.
Modeling straightforward method for Pentium EE
E i _ <D Q . X <D (/> jD O
o
O
o
Warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L
2000 4000 6000 Degree
8000 10000
Figure 4.11: Modeling the straightforward method for the Pentium EE processor.
94
4.4 Model ing the ti le m e t h o d
The computing time of the tile method entirely depends on the performance of
the register tile and the delayed carry propagation routines. However, it is difficult
to predict how efficiently a compiler will schedule the code.
In order to model performance of the tile method, we begin with a discussion
of the lower bound for computing time of the register tile and the delayed carry
propagation and compare it to the measured performance results.
In order to derive the lower bound for computing time, let us expand on Theo
rem 3.3.1 (2). Let register tile Th} be a square tile with sidelength b, see Remark 3.2.2
and Figure 3.2 (a). Let u be the number of integer execution units (IEUs) that can
be engaged simultaneously by the register tile. Current processors typically feature
2 — 4 IEUs. There will be b x b additions for the square register tile; most of the addi
tions can be simultaneously executed on u IEUs, see Figure 3.2 for an example with
u = 2. At the beginning and at the end of the register tile computation there are
several dependent additions that cannot be parallelized: ~̂ ' at each end. These
dependent additions require u— 1 cycles to execute at each end. Then—provided u 6x6 2"^"~1-)
divides b—we have ■^—i \- 2(u — 1) cycles for additions per each register tile.
Let m cycles be the latency for memory references (i.e., the LI data cache
latency). Let us further assume that memory operations can be pipelined with
CPI< 1, which is true for all current processors. This assumption implies that we
should be able to read the initial 2 words for addition or write the 2 words after the
final addition i n m + 1 cycles. It also implies that all the other memory operations
can be scheduled simultaneously with the remaining addition operations and will
95
not consume any additional cycles, see 3.3.1(2). Thus, the memory references will
contribute only 2(m + 1) cycles to the register tile computation.
Therefore, the total cost in cycles per each register tile without the carry prop
agation is at least t = -•»(«- ) _|_ 2(u — 1) + 2{m + 1).
Table 4.2 lists experimentally measured time for the register tile computation of
the optimal size value for a particular platform, see Section 3.5. The lower bound
time (in cycles) for the register tile as explained above is listed in Table 4.3. Table 4.3
also presents the corresponding measured performance and the ratio between the
two. In each case, except for the Opteron processor, there is a substantial difference
between the lower bound estimate and the experimentally measured performance.
For the UltraSPARC III processor, the datum is different from our previously pub
lished results [59] and reflects a new purpose for the measurement. In the publica
tion [59] (also see Section 3.3), we were interested in inducing the compiler to meet
the lower bound. The current measurement was performed by measuring the time
(in cycles) to call and execute the actual function that implements the register tile
in our tile method code. In addition, the new experiments were performed with
different compiler optimization flags (see Section 1.5.3) that favor performance of
the tile method as a whole—probably at the cost of individual register tiles.
Let p be the computational cost of the delayed carry propagation for each word
on the register tile border, see Figure 3.1 (b). There are 2b words (two register tile
sides) that need to have the accumulated carries thus released. There will be 2b x p
cycles spent on carry computation per each register tile.
The code illustrated in Figure 4.12 has 4 dependent instructions per each word:
1 addition instruction, 1 subtraction instruction, 1 right shift, and 1 left shift and
96
Experimental data (cycles) b 4 6 8 10 12 14 16
Pentium EE 96 96 120 144 168 600 704 Opteron 19 29 42 58 87 371 499
Pentium 4 96 126 210 284 442 532 698 UltraSPARC III 49 75 102 133 176 182 216
Table 4.2: Experimentally determined cost for register tile execution.
b Lower bound Experimentally measured Ratio Pentium EE 12 81 cycles 168 cycles 2.07
Opteron 12 81 cycles 87 cycles 1.07 Pentium 4 6 27 cycles 126 cycles 4.67
UltraSPARC III 8 39 cycles 102 cycles 2.62
Table 4.3: Cost of the b x b register tile execution in cycles. The chosen b is the optimal value for the particular platform, see Section 3.5.
3 instructions that can be overlapped: 1 read, 1 write, and 1 branch instructions.
This implies that the cost for carry propagation computation is at least 4 cycles
provided no ILP features are engaged, no branch mispredictions occur, and no load
hazards encountered.
The cost of carry propagation per word released was experimentally measured
and is presented in Table 4.4 are for register tile stack of 100 register tiles high and are
averages for 10000 runs. Previous experiments have shown significant improvement
from rolled carry propagation code to the unrolled code [59]. These results confirm
those findings.
These experimental measurements are compared with the above estimated lower
97
inline void release_carries(baseint *P, int indx, m t *P1, int span, int tile, int n) {
baseint c, s; int 1, digit, pi; pi = PI [tile]; c = 0;
for (l = 0; l < span; i++)
for (digit = 0; digit < pi; digit++) ■c
s = P[dig i t*(n+l)+mdx+i] + c; / / add carry c = s / BETA; s = s - c * BETA; P[digi t*(n+l)+indx+i] = s;
> while ( c != 0 ) {
s = P[pl*(n+l)+indx+i] + c; c = s / BETA; s = s - c * BETA; P[pl*(n+l)+mdx+i] = s; pl++;
>
// compute carry // compute current digit // set it
// add carry to existing digit // compute carry // compute current digit // set it // increment the length
PI [tile] = pi; // set the length
Figure 4.12: The rolled delayed carry release routine for the tile method.
Pentium EE Opteron Pentium 4 UltraSPARC III Rolled 37.73-38.22 7.82-8.06 15.78-16.33 14.77-14.87
Unrolled Unrolled variable
13.44-17.07 15.01-17.32
5.06-6.58 5.26-6.83
8.23-12.12 9.07-11.57
5.01-10.12 5.44-10.13
Table 4.4: Experimentally determined cost of the 3 versions of the carry propagation in processor cycles for the register tile sizes ranging from 4 x 4 to 24 x 24.
98
b Lower bound Experimentally measured Ratio Pentium EE 12 4 cycles 13.44 cycles 3.36
Opteron 12 4 cycles 5.06 cycles 1.27 Pentium 4 6 4 cycles 8.23 cycles 2.06
UltraSPARC III 8 4 cycles 5.01 cycles 1.25
Table 4.5: Cost of the delayed carry propagation in cycles.
Pentium EE Opteron Pentium 4 UltraSPARC III Tile size b 12 words 12 words 6 words 8 words
Register tile cost t 168 cycles 87 cycles 126 cycles 102 cycles Carry propagation cost p 13.5 cycles 5 cycles 8.5 cycles 5 cycles
Table 4.6: Parameters used in modeling the tile method.
bound cycles in Table 4.5. The difference between the experimental measurements
and the lower bound is small for Opteron and UltraSPARC III processors.
Ignoring the different register tile shapes and assuming only square tiles, the tile
method computes \(n+ l)/&] (|~(n+ 1)/^1 + l)/2 register tile stacks. The height of
each stack H will grow by at most 2b bits by Theorem 1.3.6(2).
The parameters used for the modeling are presented in Table 4.6. The impact
of cache misses on performance is assumed to be of minor significance.
The modeling results for the Pentium EE, AMD Opteron, Pentium 4, and Ul
traSPARC III architectures are presented and compared to measured data in Fig
ures 4.13, 4.14, 4.15, and 4.16 respectively. The measured data were generated with
a code scheduled for 2 IEUs. The modeled data were generated using the parameters
described in Table 4.6. Initial coefficients were assumed to k = 2n — 1, where n is
99
Modeling tile method for Pentium EE
E i _ <D Q . X <D
(/> o O
o O
Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H
■ — - - Lower bound t & p - measured H
2000 4000 6000 Degree
8000 10000
Figure 4.13: Modeling the tile method on the Pentium EE architecture.
the degree.
Modeling of the tile method clearly shows that the measured computational
cost of the components does not predict well the computational cost of the whole
method. Thus, accurate modeling and predicting performance is difficult; this calls
for automatic code generation and tuning, see Section 3.5.
4.4.1 Impact of changing the number of engaged IEUs
Our software was originally designed for UltraSPARC III, a processor with u = 2.
The 2 IEUs assumption is hard-coded in our register tile schedule, see Section 3.3.1.
However, there is a linear dependence between the number of engaged IEUs u and
performance. Our model for the lower bound predicts 7 — 27% and 12 — 46% per
formance improvement when u is changed from 2 to 3 and 4 respectively, see Fig-
Modeling tile method for Opteron
E ■ i _ <D Q . X <D w jD
o O
w o O
Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H Lower bound t & p - measured H
4000 6000 Degree
10000
Figure 4.14: Modeling the tile method on the Opteron architecture.
Modeling tile method for Pentium 4
E i _ <D Q . X <D (/> jD O
o
O
o
1 1 1 1 1 1 1 1
A Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H Lower bound t & p - measured H ./ V Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H Lower bound t & p - measured H
/ * • *
1: \. '>~-1: \. '>~-. ' * ■
-' ̂ \ — — -» V **■■■
1 . i , i , 2000 4000 6000
Degree 8000 10000
Figure 4.15: Modeling the tile method on the Pentium 4 architecture.
101
Modeling tile method for UltraSPARC
E i _ <D Q . X <D
w o O
w O
o
V v v * — i . Measured t & p - modeled H Measured t & p- measured H Lower bound t & p - modeled H Lower bound t & p - measured H
4000 6000 Degree
10000
Figure 4.16: Modeling the tile method on the UltraSPARC III architecture.
ure 4.17 for an example. This predicts a potentially substantial performance gain
from rescheduling the register tile code to utilize more IEUs. We suggest, therefore,
that a search for optimal u should be made a part of automatic tuning and code
generation in the future, see Section 3.5.
4.4.2 Finding optimal register tile size
The optimal tile size depends on several hard-wired features of the processor
architecture such as the number of registers in the processor's register file, word
size, and superscalar capabilities. The optimal tile size will also depend on the
register usage conventions, and the compiler.
For each sweep within the register tile at least b+u registers are required. Several
additional registers may be required by the compiler for housekeeping or due to the
102
Impact of changing the number of lEUs on Opteron processor
1 5 e + 1 0 i 1 1 1 1 1 1 1
Degree
Figure 4.17: Impact of changing the number of IEUs on the lower bound for the computing time of the tile method for AMD Opteron architecture.
register usage conventions.
For example, the UltraSPARC III processor machine is a RISC general pur
pose register architecture, which by convention restricts several registers for the OS
kernel use. On the UltraSPARC III, we had initially derived the optimal tile size
manually [59] and later confirmed it experimentally [58] using our automatic code
generation and tuning technique, see Section 3.5 below.
For the Intel processors and AMD Opteron, the GCC compiler appears to need
only 2 registers for housekeeping. Therefore, the optimal register tile size was pre
dicted to be b = p — u — 2, where p is the number of general purpose registers in the
processor's register file. For the Pentium EE, Opteron, and Pentium 4 processors,
p is 16, 16, and 8 registers respectively and the optimal b is predicted to be 12, 12,
103
and 4 respectively for u = 2. Our experiments verified that the optimal tile size for
the Pentium EE and Opteron processors is b = 12; for the Pentium 4, b = 6 gives
slightly better results.
104
5. Asymptotically fast methods
5.1 Introduction
There are 4 known asymptotically fast methods for computing Taylor shift by 1:
the Paterson and Stockmeyer's method [89, 41, 75], the divide and conquer method
[89, 41, 88, 11], the convolution method [89, 41, 5, 80], and the modular convolution
method [97].
This chapter follows the notation used by von zur Gathen and Gerhard in their
1997 work [89, 41] and by Gerhard in his 2004 work [41]. All experimental data in
this chapter was produced with polynomials with initial coefficients set to 220 — 1. In
our experiments, we have used the Taylor shift code kindly made available to us by
Jiirgen Gerhard [89, 41] and Paul Zimmermann (modular convolution method) [97].
Let f(x) be an integer polynomial and g(x) = f(x+l), then the 4 asymptotically
fast methods of Taylor shift by 1 are as follows:
Paterson and Stockmeyer's method
Assume (n + 1) = m2 is a square (padding / with leading zeroes if necessary),
and write / = J20<l<m f^xm%, with polynomials / ^ G R[x] of degree less than m
for 0 < % < m. Compute (x+ 1)* for 0 < % < m. For 0 < % < m, compute f^(x + 1)
as a linear combination of 1, (x + 1), (x + l)2 , . . . , (x + l ) m _ 1 . Finally compute
g(x) = ^^ (x + l )mVW( a ; + 1) in a Horner-like fashion. 0<i<m
105
Divide and conquer method
Assume that (n + 1) = 2m is a power of 2. Precompute (x + l)2* for 0 < % < m.
Divide f(x) as follows f(x) = f^(x) + x^n+1^2f^(x), where polynomials f^(x),
f{1)(x) E R[x] of degree less than (n + l)/2. Then g{x) = f{0){x + 1) + (x +
l)(n+1)/2j(1)(a; -f 1) where f^(x + 1) and f^l\x + 1) are computed recursively.
Convolution method
This method works if n! is not a zero divisor in R. By Theorem 1.3.1, g^ =
y I J/j. If we multiply both sides by n!&!, we obtain nlklgk = / ^ (^■ft)j'- nT &<«<«, ^ ' &<«<«, ^ ''
in _R. Let u = 2_, ilftXn~land v = n\ 2_, ~r m ^ M , then nlklgk is the coefficient
of xra_fcin the product polynomial uv.
Modular convolution method
We can compute convolution method modulo an integer m > 22n where n is
the degree as long as m is prime to n\. This fact is based on a trivial observation
that for l/loo < n the coefficients of g(x) have at most 2n bits. This method works
only with non-negative coefficients but can be extended to work with all integer
coefficients, for example, by separating negative and positive coefficients and writing
f(x) = f+(x) — f~(x) and computing the corresponding g(x) separately [97]. We
did not implement the method capable of handling negative coefficients.
The code used in experiments for the divide and conquer, convolution, and
Paterson-Stockmeyer methods was NTL-based. The code used for the modular
convolution method was GNU-MP-based.
106
Classical Taylor shift vs. divide and conquer Speedup of tile method relative to Divide&Conquer
11 i i i i i i i i i i
0 2000 4000 6000 8000 10000 Degree
Figure 5.1: The tile method is faster than the asymptotically superior divide and conquer method for a wide range of degrees.
5.2 Performance of the fast methods
Initially we have compared performance of von zur Gathen and Gerhard's NTL-
based divide and conquer Taylor shift by 1 [89, 41] with our tile method [59]. The
tile method is faster up to degree 6000 on all 4 target platforms, see Figure 5.1.
This result was published in 2006 [58].
We have compared the 4 asymptotically fast methods on our 4 target architec
tures. The performance of the methods relative to the tile method is reported in
Figures 5.2, 5.3, 5.4, and 5.5. The divide and conquer method outperforms all the
oretically fast methods for degrees 100 < n < 7500 on all 64-bit architectures. The
data was collected for polynomials Bn>d with d = 220 — 1. We did not measure for
degrees n > 7500 because the experiments would take very long to run.
107
Fast methods on Pentium EE
T3 O
15 E <D
Q) 8
! : X 5
3 ^ o O
X ~T
-**-\
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
2000 4000
Degree 6000 8000
Figure 5.2: All asymptotically fast methods are slower than the divide and conquer method on the Pentium EE. The convolution method is over 80 x slower than the tile method and is not shown.
Fast methods on Opteron 13
12 T3 O 11 C-
<D 10
E 9 <i> — R -•—» t/> <D 1
O > s b o
■ — . 5 X
4 t/> <i> o > s O 2
\ ' , ' i 1 i '
\ » Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ \ Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ \ Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ »
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer t \ v
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer \ ^ '•
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ \ V
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ \ V \ ^ ' " -
\ A. ""̂ \ / \ > ■ \ / \ . v
" ^ ^
i , i i 2000 4000
Degree 6000 8000
Figure 5.3: All asymptotically fast methods are slower than the divide and conquer method on the Opteron. The convolution method is over llOx slower than the tile method and is not shown.
108
Fast methods on Pentium 4
T3 O .C ID E <D
o o
w o O
T~
\ \ Divide and conquer
Convolution
Modular convolution
■ — ■ - Paterson Stockmeyer
2000 4000
Degree 6000 8000
Figure 5.4: All asymptotically fast methods are slower than the divide and conquer method on the Pentium 4. The convolution method is over 50 x slower than the tile method and is not shown.
Fast methods on UltraSPARC
T3 O .C ID E CD
jD O
jD O
o
\ V ' i * i * \ \ Divide and conquer
Convolution
Modular convolution
Paternson Stockmeyer
\ \ \ \
—
Divide and conquer
Convolution
Modular convolution
Paternson Stockmeyer
1 \ 1 * —
Divide and conquer
Convolution
Modular convolution
Paternson Stockmeyer \ \ \ \
—
Divide and conquer
Convolution
Modular convolution
Paternson Stockmeyer
\ \ """̂ \ ■ \
—
Divide and conquer
Convolution
Modular convolution
Paternson Stockmeyer
\ \ """̂ \ ■ \
\ \
V •—• s ;̂- ■ * « • ,
•v ^
i , i i 2000 4000
Degree 6000 8000
Figure 5.5: All asymptotically fast methods are slower than the divide and conquer method on the UltraSPARC III. The convolution method is over 200 x slower than the tile method and is not shown.
109
5.3 Comput ing t imes in the l iterature
Von zur Gathen and Gerhard [89, 41] have published computing times for their
NTL-based implementation of the asymptotically fast method. Tables 5.1, 5.2, 5.3,
5.4, 5.5, and 5.6 quote those computing times and list our experimental measure
ments for the three methods on the UltraSPARC III, Pentium 4, Opteron, and
Pentium EE processors. The computing times we quote were obtained on an Ultra
SPARC workstation rated at 167 MHz [89] and on a Pentium III 800 MHz Linux
PC [41]; our experiments were performed using compiled libraries as described in
Section 1.5.3. Von zur Gathen and Gerhard ran their program letting k = 7 , . . . , 13
and n = 2k — 1 for input polynomials of degree n. Tables 5.1, 5.2, and 5.3 quote and
report computing time using "small" coefficients, i.e, max-norm < n. Tables 5.4, 5.5,
and 5.6 quote and report computing time using "large" coefficients, i.e., max-norm
< 2n+1. The integer coefficients were pseudo-randomly generated; we used the same
input polynomials in our experiments. Our data corroborates Von zur Gathen and
Gerhard's conclusion that divide and conquer method is the fastest among the three
methods for high degrees.
5.4 Improving performance of the fast methods
Broadly, there are two ways to approach redesigning the fast methods: a top to
bottom and a bottom-up approach. The former is time consuming but in the long
run may lead to better performance results. The tile method for the classical Taylor
shift by 1 [59] is an example of a complete algorithm redesign for high performance.
The bottom-up hierarchical approach in the case of the asymptotically fast Taylor
110
Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE
31 — — 0.00065 0.00841 0.00018 0.00021
63 — — 0.00180 0.00050 0.00049 0.00047
127 0.260 0.010 0.00452 0.00142 0.00123 0.00113
255 1.640 0.072 0.01226 0.00430 0.00331 0.00307
511 11.450 0.432 0.04065 0.01630 0.01012 0.00939
1023 86.090 2.989 0.15353 0.07650 0.03795 0.03665
2047 713.200 16.892 0.68125 0.36726 0.18785 0.17505
4095 — 125.716 3.16158 1.84255 0.91648 0.86930
8191 — — 17.34230 9.76371 4.55053 4.37315
Table 5.1: Computing times (s.) for the divide and conquer method of Taylor shift by 1 —"small" coefficients.
Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE
31 — — 0.00058 0.00028 0.00017 0.00022
63 — — 0.00222 0.00115 0.00064 0.00064
127 0.060 0.044 0.01515 0.00973 0.00418 0.00407
255 0.640 0.290 0.10956 0.06471 0.02940 0.02820
511 7.430 2.007 0.67894 0.44935 0.19416 0.18580
1023 87.570 13.958 4.60220 3.09768 1.25495 1.19873
2047 1387.390 98.807 33.93170 22.13006 8.53665 8.15568
4095 — 787.817 240.62589 160.39211 59.54717 56.98022
8191 — — 1721.13593 1666.83422 423.89578 2509.76500
Table 5.2: Computing times (s.) for the convolution method of Taylor shift by 1 —"small" coefficients.
I l l
Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE
31 — — 0.00033 0.00008 0.00009 0.00019
63 — — 0.00088 0.00026 0.00027 0.00026
127 0.080 0.010 0.00301 0.00115 0.00089 0.00083
255 0.440 0.072 0.01318 0.00511 0.00376 0.00364
511 2.480 0.602 0.06557 0.02785 0.01925 0.01820
1023 15.530 6.364 0.36632 0.17988 0.11426 0.10303
2047 102.640 57.744 1.34248 0.83561 0.41286 0.41554
4095 — 722.757 10.29823 5.78704 2.93546 3.18046
8191 — — 71.02069 42.28142 19.20488 22.63864
Table 5.3: Computing times (s.) for the Paterson & Stockmeyer method of Taylor shift by 1 —"small" coefficients.
Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE
31 — — 0.000695 0.000192 0.000178 0.000213
63 — — 0.001978 0.000600 0.000514 0.000536
127 6.340 0.489 0.005770 0.002020 0.001446 0.001503
255 107.000 9.566 0.019537 0.008620 0.005227 0.005123
511 — 166.138 0.089770 0.050037 0.024322 0.023670
1023 — — 0.422919 0.187685 0.113050 0.116488
2047 — — 1.715237 0.983371 0.481346 0.472606
4095 — — 8.759840 5.525364 2.488822 2.449783
8191 — — 51.315470 32.580679 13.996941 13.693067
Table 5.4: Computing times (s.) for the divide and conquer method of Taylor shift by 1 —"large" coefficients.
112
Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE
31 — — 0.000621 0.000257 0.000187 0.000176
63 — — 0.002456 0.002083 0.000721 0.000685
127 7.880 0.524 0.017887 0.013282 0.004989 0.004713
255 241.540 13.262 0.122438 0.287136 0.036668 0.032581
511 7453.690 234.087 0.762593 0.674811 0.278306 0.270762
1023 — — 5.237463 3.619739 1.476873 1.424302
2047 — — 38.150037 25.160389 9.522594 9.090904
4095 — — 268.787407 182.303800 66.651315 63.672271
8191 — — 1928.247407 2672.559111 475.816648 1874.960729
Table 5.5: Computing times (s.) for the convolution method of Taylor shift by 1 —"large" coefficients.
Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE
31 — — 0.000336 0.002618 0.002409 0.083244
63 — — 0.001028 0.345857 0.000289 0.015961
127 4.810 0.700 0.004395 0.078607 0.001279 0.266423
255 76.210 14.894 0.022186 2.403883 0.006682 0.795823
511 1289.730 420.562 0.128907 8.865631 0.040849 0.676700
1023 — — 0.830711 2.888625 0.300753 2.038819
2047 — — 3.478230 2.746924 1.002791 1.055621
4095 — — 29.064893 17.772409 8.059511 9.584729
8191 — — 218.780667 151.508244 58.104000 72.311823
Table 5.6: Computing times (s.) for the Paterson & Stockmeyer method of Taylor shift by 1 —"large" coefficients.
113
n 10
T3 ° 9
E 8
^ 7
i 6 & 5 O
£ 2 o
1
Fast methods on Pentium EE with GNU-MP arithmetic
\ V I ' I ' \ V \ \ x
\ \ N \ \ x
\ \ N Convolution Modular convolution
■ — - - Paterson Stockmeyer
\ v
\ x x Convolution Modular convolution
■ — - - Paterson Stockmeyer \ x Xx
Convolution Modular convolution
■ — - - Paterson Stockmeyer \ \ v
Convolution Modular convolution
■ — - - Paterson Stockmeyer
\ * s
\ sv X \ V <- — ^
I , I , I 2000 4000
Degree 6000 8000
Figure 5.6: Using 64-bit arithmetic on the Pentium EE improves the crossover point. The convolution method is over 30 x slower than the tile method and is not shown.
shifts by 1 would involve improving integer arithmetic, polynomial arithmetic, and
high-level coding. Only the first has been explored experimentally.
5.4.1 Replacing native N T L ari thmet ic w i th G N U - M P ari thmet ic
If we replace the native NTL arithmetic, which uses a radix of 230 with GNU-
MP arithmetic, which uses a radix of 264 (except in case of Pentium 4 architecture
where the radix is 232), we get a speedup that moves the crossover points in the
desirable direction, see Figures 5.6, 5.7, 5.8, and 5.9. However, the improvement in
the performance is not sufficient to make the fast methods superior for any practical
degree range.
Replacing the native GNU-MP arithmetic with Gaudry's patch on the Opteron
114
Fast methods on Opteron with GNU-MP arithmetic
4000
Degree 8000
Figure 5.7: Using 64-bit arithmetic on the Opteron improves the crossover point. The convolution method is over 25 x slower than the tile method and is not shown.
Fast methods on Pentium 4 with GNU-MP arithmetic
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
4000
Degree 8000
Figure 5.8: Using 64-bit arithmetic on the Pentium 4 improves the crossover point. The convolution method is over 8x slower than the tile method and is not shown.
115
8000
Figure 5.9: Using 64-bit arithmetic on the UltraSPARC III improves the crossover point. The convolution method is over 62 x slower than the tile method and is not shown.
processor causes a minor improvement for the divide and conquer method but a
significant improvement for the convolution method, which is the slowest method,
see Figure 5.10.
Thus, improving integer arithmetic leads to significant performance advantage.
However, this advantage does not lead to a change in the crossover point such that
any of the fast methods becomes a viable replacement for the tile method.
Nonetheless, studying the computing times changes when switching from the
native NTL arithmetic to GNU-MP-based gives us an estimate on how much im
provement we can expect if we recode the algorithm in GNU-MP. This would not
be a trivial task however, as we would need to implement the necessary polyno
mial arithmetic using the GNU-MP integer arithmetic. Polynomial arithmetic is
18 17 1fi
a o 1b .c 14 (I) E 13 <D 12
til
11
(/> 10 (I) o y > s R o ^ / X 6
(/> ^ (I) o 4 > s ? () 2
1 0
Fast methods on UltraSPARC with GNU-MP arithmetic
t \ ' ' 1 1 '
. \ Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ • \ Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ x x
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer \ x """"%
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer \ t \
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer \ • \
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ x
Divide and conquer
Convolution
Modular convolution
Paterson Stockmeyer
\ x
^ x. \ * ̂ . \ " '*'" • •».
"*
I , I i 2000 4000
Degree 6000
116
Fast methods on Opteron with GNU-MP arithmetic and Gaudry's patch
11 i i i i i i i i
0 2000 4000 6000 8000 Degree
Figure 5.10: Gaudry's patch improves the crossover point on the Opteron. The convolution method is over 19 x slower than the tile method and is not shown.
efficiently implemented in the NTL package but is not provided by the GNU-MP
library.
5.4.2 Observations on coding
The fast methods (courtesy of Jiirgen Gerhard) are skillfully coded using calls
to the NTL library. However, NTL is written in a portable C + + and treats both
polynomials and integers as ADT classes with constructors and destructors. All
NTL-based fast methods for Taylor shift by 1 involve multiple copying of the inter
mediate polynomials. Such construction and destruction is probably costly. Using
in-place computation techniques may be warranted.
117
5.5 Conclusions
None of our approaches for improving performance of the fast Taylor shifts by 1
algorithms have yielded a method that outperforms our tile method for any currently
useful degree. However, we have demonstrated that effective utilization of features
of computer architecture significantly affects crossover points.
118
6. Appl icat ions
In this chapter, we discuss application of the tile method both as a technique
for improving analogous computer algebra algorithms and as a high-performance
algorithm that can be used in important applications. Specifically, de Casteljau's
algorithm has similar pattern of addition and can be redesigned for performance
using the register tiling technique from the tile method for Taylor shift by 1. Both
algorithms play an important role in the two variants of the Descartes method for
real root isolation.
6.1 High-performance de Casteljau's a lgori thm
The Taylor shift by 1 and de Casteljau's algorithms share a similar computational
structure. The algorithms have a similar dependency pattern, but the order of
computation is reversed. Figures 6.1 and 6.2 present diagrams of addition direction
within the Pascal triangle for the two algorithms.
The technique of register tiling that makes the classical Taylor shift by 1 effi-
a. b.
Figure 6.1: The computation structure of (a) Taylor shift by 1 and (b) de Casteljau's algorithms.
119
0 0 0 0 i i 1 i
(ln ~> a 0 , 0 Qo,i ^0,2 ^0,3
Q>n-1 -> Qi,o «i , i ai,2 d"n-2 "> «2,0 «2,1 \ dn-3 "> «3,0 \
(a)
#,' 62 6? 6g T T T T
^3 < — ^3,0 &2,1 &12 t>0,3
b2 <— 62,0 &i,i bo,2 b[ <- 6i)0 6o,i \
(b)
h%3 = a * , j - i + Figure 6.2: (a) The pattern of integer additions in Pascal's triangle, CLi-i>3) can be used to perform Taylor shift by 1. (b) In de Casteljau's algorithm all dependencies are reversed, the intermediate results are computed according to the recursion b3>l = b3-\>t + b3-\>t+\.
Figure 6.3: Register tiling can be applied to (a) Taylor shift by 1 and (b) de Casteljau's algorithm. Arrows show direction of addition.
120
cient [59] can be used to redesign de Casteljau's algorithm as well, see Figure 6.3.
Tiling the classical Taylor shift by 1 was discussed in Chapter 3.
In order to allow our tiled implementations to be used on multiple architectures,
we have implemented an automatic code generator in Perl [91] for de Casteljau's
algorithm. The generator is similar to the one used for generating the tile method
for Taylor shift by 1, see Section 3.5. It produces portable ANSI C + + [52, 13] code
for tiles of different sizes, compiles and executes the code, and searches through
successively larger tile sizes until a tile size with peak performance is discovered.
The limitations of the tile method for Taylor shift by 1 transferred to the tile method
of de Casteljau's algorithm: a single addition schedule for a CPU with two addition
units.
6.2 High-performance Descartes m e t h o d
In this section we describe our implementation of the Descartes method using
the tiled version of de Casteljau's algorithm, see Section 6.1. We then compare
performance of our implementation with the method by Hanrot et al., the SYNAPS
method, and two architecture-unaware implementations from SACLIB library [22].
6.2.1 Monomial vs . Bernste in bases
The Descartes method, independent of the basis used to represent polynomials,
uses binary search to find isolating intervals and relies on the Descartes rule of signs
to determine when an isolating interval has been found or when the search can stop
since there are no roots in the given interval. Let A(x) = amxm + • • • + (i\X + a0.
The Descartes rule states that the number of coefficient sign variations, var(A), is
121
greater than or equal to the number of positive roots of A, and that the difference
is even. This provides an exact test when var(A) G {0,1}.
The following polynomial transformations are needed for the method and the
mapping between the monomial basis and the Bernstein basis:
1. Translation Tc(A(x)) = A(x - c),
2. Reciprocal transformation: R(A(x)) = xmA(l/x), and
3. Homothetic transformation: Ha(A(x)) = A(x/a).
The method proceeds by using a root bound and a homothetic transformation to
transform the input polynomial to a polynomial, A, whose roots in the interval (0, 1)
correspond to the positive roots of the input polynomial. It can be advantageous
to compute the negative roots separately using a separate root bound for the neg
ative roots. When using the monomial basis, the Descartes rule is applied to the
transformed polynomial A* = T_iR(A) to determine whether A has zero or one real
roots in the interval (0, 1). Bisection is performed by computing the transformed
polynomials Ax = H2(A) and A2 = T_iH2(A) whose roots in the interval (0,1)
correspond to the roots of A in the intervals (0,1/2) and (1/2,1) respectively. The
Descartes rule is then applied to A\ = T_iR(A\) and A2 = T_iR(A2), and if more
than one coefficient sign variation is obtained the algorithm proceeds recursively
with the bisected polynomials.
Associated with this bisection process is a binary tree, where each node in the
tree has an associated subinterval and polynomial. Each internal node requires the
computation of three polynomial translations T_i, i.e., Taylor shift by 1, to compute
122
the bisection polynomial and the two applications of the Descartes rule, while leaf
nodes only require the polynomial translations for the application of the Descartes
rule. The bulk of the computing time for the method is devoted to Taylor shift by
1. Figure 6.2(a) shows the classical computation of A(x + 1) = J2™=0 am-h,hXh-, see
also Theorem 1.3.3. Note that it is possible to avoid the complete computation of
the Taylor shift in the application of the Descartes rule by stopping as soon as more
than one sign variation is detected.
Let Bm>i(x) = (™)x*(l — x)m~\ i = 0 , . . . , m be the Bernstein basis, and write
Mx) = YZokBm,t(x). Since T^R{Bm>t{x))= (™)xm~\ T^R{A{x)) =E™o ( 7 ) M "
and v&r(A*(x)) = var(60, •••, bm). The Bernstein representation of the bisection poly
nomials, Ai(x) and A2(x), can be obtained from the coefficients of the Bernstein
representation of A(x) using de Casteljau's algorithm. In order to preserve integer
coefficients a fraction-free variant is used. For 0 < i < m set 60,« = h, and for
1 < j < m and 0 < % < m — j set b3>l = b3-\ + b3-\>t+\. As Figure 6.2(b) shows, this
computation is similar to the computation of the Taylor shift by 1, except that the
computation proceeds in the reverse direction. Eigenwillig et al. [31] remark that if
b[ = 2m~%fi and b"% = Tbm.v, A1(x) = £ ™ 0 VtBm>t(x) and A2{x) = TZo b'lBm>l(x).
This establishes a one-to-one mapping between the nodes and the associated polyno
mials in the search trees for the monomial and Bernstein variants of the algorithm.
Moreover, the cost of the computation at each node, assuming classical algorithms
for de Casteljau's and Taylor shift, for both variants is codominant and hence the
total computing time is codominant. In contrast to the monomial basis, each inter
nal node requires one application of de Casteljau's algorithm instead of three Taylor
shifts by one and no transformations are required for leaf nodes. A similar approach,
123
called the dual algorithm, which also reduces the number of Taylor shifts by com
puting A\(x) and A^ix) directly from A*(x) using monomial bases was suggested
by Johnson [56].
6.2.2 The Descartes methods we compare
The descriptions of the Descartes methods we compared follow.
The monomial SACLIB method IPRRID
The program IPRRID in the SACLIB library [22] processes the bisection tree
in breadth-first order [62, 78]. The IPRRID program calls the IUPTR1 routine,
i.e., the classical Taylor shift by 1 algorithm included in the SACLIB library, see
Section 2.4. The IPRRID routine also includes calls to a partial Taylor shift by 1
and will avoid the complete computation of the Taylor shift in the application of
the Descartes rule by stopping as soon as more than one sign variation is detected.
Also, IPRRID checks whether var(A) ^ 0 before computing T-\R(A).
The Bernstein SACLIB method IPRRIDB
The program IPRRIDB (courtesy of G. E. Collins) is based on the SACLIB
library [22]. The program converts the input polynomial from its monomial rep
resentation into a fraction-free Bernstein-basis representation. IPRRIDB processes
the bisection tree in the same way as the program IPRRID of Section 6.2.2 above.
IPRRIDB uses a fraction-free version of de Casteljau's algorithm that avoids the
overhead of calling integer addition routines and of normalizing after each integer
addition—in the same way as the SACLIB Taylor shift by 1 program IUPTR1, see
124
Sections 6.2.2 and 2.4.
The m e t h o d by Hanrot et al.
Hanrot et al. [45] provide an efficient implementation of the monomial version of
the Descartes method that incorporates the memory-saving technique of Rouillier
and Zimmermann [78]. Their implementation uses GNU-MP [44] for the integer
additions required by Taylor shift operations.
Additional algorithmic optimizations are included to reduce the time spent on
Taylor shift. The complete execution of the Taylor shift used to compute T_XR prior
to the application of the Descartes rule is not needed in many situations. If all of
the input coefficients are of the same sign, then the transformed polynomial will
have zero coefficient sign variations and the Taylor shift can be avoided. If all of
the intermediate coefficients in a column of the Taylor shift computation have the
same sign, then the remaining result coefficients will have the same sign, and the
computation can be aborted. If exactly two sign variations are reported, then there
are either zero or two roots in the interval. If the signs of the polynomial evaluated
at 0 and 1 are equal but different from the sign at 1/2, then two roots have been
found and the algorithm can terminate avoiding the additional Taylor shifts needed
for the Descartes test to report the termination. This test is efficient to apply since
the polynomial evaluated at 1 is equal to the sum of the coefficients and the sum
is known after computing the first column of intermediate coefficients in the Taylor
shift computation. In practice, computation of a partial rather than a complete
Taylor shift along with the early termination tests can save a substantial amount of
time.
125
In a pre-processing step the method determines the greatest k such that the
input polynomial A(x) is a polynomial in xk, and replaces A(x) by A(j/x). If k is
even, the method isolates only the positive roots.
The SYNAPS method
The SYNAPS [71] implementation IslBzInteger<QQ> [32] of the Descartes method
uses GNU-MP [44] for the integer additions required by the de Casteljau's opera
tions. Otherwise, the method is a straightforward implementation of the Bernstein-
bases variant. A hard-coded limitation of the recursion depth to 96 prevents the
method from isolating the roots of Mignotte polynomials of degrees greater than 80.
6.2.3 Performance results
We measured the executions times of the five implementations of the Descartes
method on the four processor architectures for input polynomials of various degrees
from the three classes of polynomials, see Section 1.5.4. The data are given in
Tables 6.1, 6.2, 6.4, and 6.3. Figures 6.4, 6.5, 6.6 and 6.7 plot the performance
gain obtained by the methods with respect to the SACLIB routine IPRRID for
input polynomials of various degrees. Gains by an order of magnitude are typical.
The largest speedup is by a factor of 24, and it is obtained by the Bernstein-based
variant of the Descartes method with register tiling on the Opteron processor for
the Chebyshev polynomial of degree 1000.
The data show that high performance can be achieved using a number of algo
rithmic devices:
126
Real root isolation of Random polynomials Real root isolation of Random polynomials Architecture Intel Penitum EE Architecture AMD Opteron
200 400 000 800 000 ° 200 400 000 800 000 Degree Degree
Figure 6.4: Speedup with respect to the monomial SACLIB implementation for random polynomials on four architectures.
Real root isolation of Chebyshev polynomials Architecture Intel Pentium EE
Real root isolation of Chebyshev polynomials Architecture AMDOpteron
Degree
Real root isolation of Chebyshev polynomials Architecture Sun UltraSPARC III
— SACLBBe nsten I ' ' ' ' ' I
Degree
Real root isolation of Chebyshev polynomials Architecture Intel Pentium 4
SACL B Be nste n TeclBei SYNAPS
Degree Degree
Figure 6.5: Speedup with respect to the monomial SACLIB implementation for Chebyshev polynomials on four architectures.
128
Real root isolation of reduced Chebyshev polynomials Architecture Intel Pentium EE
Degree
Real root isolation of reduced Chebyshev polynomials Architecture Sun UltraSPARC III
Real root isolation of reduced Chebyshev polynomials Architecture AMD Opteron
Degree
Real root isolation of reduced Chebyshev polynomials Architecture Intel Pentium 4
Degree Degree
Figure 6.6: Speedup with respect to the monomial SACLIB implementation for reduced Chebyshev polynomials on four architectures.
129
Real root isolation of Mignotte polynomials Architecture Intel Pentium EE
Degree
Real root isolation of Mignotte polynomials Architecture Sun UltraSPARC III
SACLBBe nste n T ed Be nste n
^ - ^ N
.—- " ^ S^ ^< = :<< : ; : .^
Real root isolation of Mignotte polynomials Architecture AMDOpteron
Degree
Real root isolation of Mignotte polynomials Architecture Intel Pentium 4
Degree Degree
Figure 6.7: Speedup with respect to the monomial SACLIB implementation for Mignotte polynomials on four architectures.
130
1. The use of Bernstein bases can be viewed as a way to reduce the number of
reopera t ions per internal node of the recursion tree in the monomial vari
ant from 3 Taylor shifts to 1 de Casteljau's transformation. A comparison
between the SACLIB methods IPRRID and IPRRIDB shows that this ap
proach is successful—despite the fact that the initial transformation of the
input polynomial into the Bernstein-base representation can increase the co
efficient length.
2. Hanrot et al. achieve a similar reduction in the number of n3-operations by
partial execution of certain Taylor shifts. In addition, their early termination
test avoids all n3-operations at certain leaf nodes. For reduced Chebyshev
polynomials this device reduces the number of complete Taylor shifts by 40%.
3. The use of the assembly-language integer addition routines of GNU-MP makes
the SYNAPS method faster than the SACLIB method IPRRIDB for polynomi
als with long coefficients, and it contributes to making the method by Hanrot
et al. faster than the SACLIB method IPRRID.
4. The use of register tiling is orthogonal to devices (1) and (2). In fact, in an
additional experiment we replaced the complete Taylor shift in the method by
Hanrot et al. with our tiled Taylor shift and obtained an additional speed-up
by a factor of about 1.33.
The three implementations of the Bernstein-bases variant might be further improved
by incorporating the early termination test by Hanrot et al.
The data show that—with minor exceptions—for all classes of polynomials, the
131
best absolute computing times are achieved on the Opteron processor using the
Bernstein-bases variant of the Descartes method with register tiling.
Deg SACLIB SACLIB Tiled Hanrot SYN-mon. Bern. Bern. et al. APS
Ran 100 8 5 4 4 11 Ran 200 81 44 20 25 44 Ran 300 311 148 57 103 115 Ran 400 787 360 128 194 232 Ran 500 1708 733 252 430 417 Ran 600 2257 1071 376 687 617 Ran 700 4416 1884 641 1058 982 Ran 800 8706 3309 1143 2361 1496 Ran 900 11679 4832 1610 2090 2132 Ran 1000 18155 6761 2274 3417 2741 Che 100 344 108 60 4 55 Che 200 6316 1712 608 52 764 Che 300 34606 8980 2604 240 3728 Che 400 115675 29409 8284 708 11664 Che 500 296434 74340 23748 1860 28625 Che 600 638131 156701 48267 3720 59842 Che 700 1207523 285578 90662 6540 116087 Che 800 2106963 492447 156935 10688 209550 Che 900 3455690 811711 261720 18033 351207 Che 1000 5388421 1320835 418654 27326 553752 Red 100 24 8 8 4 8 Red 200 348 108 64 52 60 Red 300 1924 544 236 240 268 Red 400 6372 1756 624 708 840 Red 500 16845 4372 1372 1860 2044 Red 600 35046 9092 2612 3720 4068 Red 700 67820 17237 4940 6540 7568 Red 800 116783 30049 8380 10688 12729 Red 900 193304 48583 15049 18033 20333 Red 1000 302510 75012 23685 27326 31286 Mig 100 4736 1288 760 728 N/A Mig 200 143673 38642 17932 20721 N/A Mig 300 1083263 290385 132196 167387 N/A Mig 400 4536036 1213095 660173 721453 N/A Mig 500 13777640 3682927 2470037 2204722 N/A Mig 600 34211121 9110764 7533232 5468834 N/A
Table 6.1: Root isolation timings in milliseconds for Intel Pentium EE.
Deg SACLIB SACLIB Tiled Hanrot SYN-mon. Bern. Bern. et al. APS
Ran 100 8 5 4 3 12 Ran 200 87 40 20 20 48 Ran 300 334 136 55 67 118 Ran 400 856 339 120 155 233 Ran 500 1885 716 230 331 410 Ran 600 2526 1055 347 479 605 Ran 700 4981 1864 563 868 942 Ran 800 9870 3258 943 1540 1402 Ran 900 13190 4754 1333 2247 1949 Ran 1000 20536 6645 1810 3048 2486 Che 100 364 96 60 8 50 Che 200 6700 1532 584 48 600 Che 300 37134 8268 2316 216 2995 Che 400 128004 27777 6792 584 9535 Che 500 330172 70956 15737 1524 23610 Che 600 704420 149721 29806 3172 48929 Che 700 1342976 285493 55752 5844 92164 Che 800 2341022 492447 95889 9973 161582 Che 900 3839292 811711 163023 17093 277293 Che 1000 5988138 1268569 245834 26258 458819 Red 100 20 8 8 8 8 Red 200 372 100 60 48 60 Red 300 2040 488 228 216 228 Red 400 6760 1572 588 584 664 Red 500 17545 3920 1296 1524 1620 Red 600 37574 8380 2312 3172 3276 Red 700 73672 16105 4224 5844 6184 Red 800 129424 27993 6676 9973 10453 Red 900 214697 46211 10456 17093 16885 Red 1000 334341 71400 15377 26258 25766 Mig 100 5092 1184 712 644 N/A Mig 200 159362 36122 16069 20729 N/A Mig 300 1209179 270337 128412 171159 N/A Mig 400 5081274 1127943 617024 740666 N/A Mig 500 15445227 3488480 2179611 2270234 N/A Mig 600 38326948 8799792 5526025 5678407 N/A
Table 6.2: Root isolation timings in milliseconds for AMD Opteron.
Deg SACLIB SACLIB Tiled Hanrot SYN-mon. Bern. Bern. et al. APS
Ran 100 17 15 13 11 69 Ran 200 143 107 66 51 247 Ran 300 531 352 184 139 560 Ran 400 1347 869 404 292 1027 Ran 500 2976 1824 768 600 1705 Ran 600 4020 2732 1130 873 2509 Ran 700 7922 4829 1892 1579 2695 Ran 800 15546 8482 3168 2815 5203 Ran 900 21250 12483 4526 4115 6986 Ran 1000 32647 17380 6096 5570 8750 Che 100 510 270 180 20 152 Che 200 8820 3940 1930 120 1168 Che 300 48470 20740 8100 450 5477 Che 400 168280 75840 23620 1220 18924 Che 500 439010 190800 55880 2630 49640 Che 600 940020 423620 110160 5260 105222 Che 700 1804440 741890 201320 9710 198604 Che 800 3149370 1278380 331230 16590 343092 Che 900 5172400 2076420 539700 29300 559625 Che 1000 8282860 3215740 800600 45970 870306 Red 100 40 30 20 20 40 Red 200 570 270 190 120 160 Red 300 2990 1280 750 450 520 Red 400 9710 4000 1990 1220 1290 Red 500 24980 10000 4320 2630 2950 Red 600 52990 20970 8150 5260 6130 Red 700 106540 40390 14800 9710 12030 Red 800 181930 70670 23700 16590 21150 Red 900 303660 117100 38490 29300 35110 Red 1000 473670 181870 57990 45970 54780 Mig 100 6790 3140 2540 1010 N/A Mig 200 214470 95480 51320 36170 N/A Mig 300 1619100 710360 339900 272690 N/A Mig 400 7208290 3037420 1641610 1333880 N/A Mig 500 24018450 10311010 6627000 6335480 N/A Mig 600 67752340 28252170 23996430 21738150 N/A
Table 6.3: Root isolation timings in milliseconds for UltraSPARC III.
Deg SACLIB SACLIB Tiled Hanrot SYN-mon. Bern. Bern. et al. APS
Ran 100 10 7 6 6 18 Ran 200 94 50 38 38 70 Ran 300 405 187 128 142 174 Ran 400 996 417 283 262 343 Ran 500 2468 945 687 562 602 Ran 600 2875 1334 1050 889 902 Ran 700 6108 2401 2108 1370 1414 Ran 800 14193 4814 4836 3037 2124 Ran 900 10946 4690 4967 2712 3014 Ran 1000 22100 7712 8871 4402 3874 Che 100 344 118 115 8 80 Che 200 6488 1829 1294 80 957 Che 300 35613 9455 6495 343 4675 Che 400 119135 30989 23292 942 14895 Che 500 305045 78069 68264 2365 37335 Che 600 650096 164264 160848 4691 78986 Che 700 1240013 310566 326062 8334 153477 Che 800 2167662 538754 594845 13734 277534 Che 900 3556937 885691 1015461 23189 462300 Che 1000 5550107 1372243 1633486 35406 729782 Red 100 20 10 12 8 13 Red 200 360 123 115 80 88 Red 300 1969 590 465 343 361 Red 400 6619 1874 1329 942 1060 Red 500 17166 4679 3156 2365 2565 Red 600 36299 9642 6594 4691 5175 Red 700 69815 18231 12932 8334 9730 Red 800 120908 31333 23556 13734 16459 Red 900 200134 51231 42960 23189 26815 Red 1000 311011 78886 68456 35406 41241 Mig 100 4851 1402 1627 985 N/A Mig 200 147705 41103 47363 27469 N/A Mig 300 1118406 307373 648885 212914 N/A Mig 400 4691741 1282456 4383761 892932 N/A Mig 500 14239431 3881311 16920331 2700368 N/A Mig 600 35312569 9598795 51724931 6666565 N/A
Table 6.4: Root isolation timings in milliseconds for Intel Pentium 4.
136
7. Future research
Several questions remain:
1. Our modeling predicts a substantial performance gain from rescheduling the
register tile code to utilize more IEUs, see Section 4.4.1. We suggest, therefore,
that a search for optimal u should be made a part of automatic tuning and
code generation. For example, Opteron should be able to do 3 additions per
cycle as the processor has 3 IEUs.
2. We have experimented only with the square-shaped register tiles. Effect of
register tile shape on performance should be analyzed. Another shape may
yield a better performance. A search for the optimal tile shape then should
be incorporated into automatic code generation and tuning process.
3. Applying Hanrot et al.'s early-termination tests to the Bernstein bases variant
of the Descartes method with register tiled de Casteljau's algorithm will likely
yield the fastest known implementation of the Descartes method.
4. Whether the sparse interlaced array representation for polynomials is a more
efficient data structure for the tile method than a noninterlaced representation
should be verified experimentally.
5. A larger register file in the CPU (i.e., with more available registers) will allow
larger register tiles. This would further reduce the cost of carry propagation.
137
6. Using a processor with wider registers (i.e., 128-bit vs. 64-bit wide) will im
prove performance as the wider registers would allow for a larger radix. A
larger radix will shorten the integer coefficients and further reduce the num
ber of additions.
138
Bibliography
Automatically Tuned Linear Algebra Software (ATLAS), h t t p : / /math-a t las . sourceforge .net/.
Advanced Micro Devices, Inc., AMD Eighth-Generation Processor Architecture, http://www.amd.com/, October 2001.
, Processor Reference, http://www.amd.com/, June 2004.
, Software Optimization Guide for AMD64 Processors, http://www. amd.com/, September 2005.
A. V. Aho, K. Steiglitz, and J. D. Ullman, Evaluating polynomials at fixed set of points, SIAM Journal on Computing 4 (1975), no. 4, 533-539.
Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman, The design and analysis of computer algorithms, Addison-Wesley Publishing Company, 1974.
Randy Allen and Ken Kennedy, Optimizing compilers for modern architecture: A dependence-based approach, Morgan Kaufmann Publishers, New York, 2002.
David H. Bailey, King Lee, and Horst D. Simon, Using Strassen's algorithm to accelerate the solution of linear systems, Journal of Supercomputing 4 (1990), no. 4, 357-371.
Saugata Basu, Richard Pollack, and Marie-Frangoise Roy, Algorithms in real algebraic geometry, Springer-Verlag, 2003.
, Algorithms in real algebraic geometry, second ed., Springer-Verlag, 2006.
D. Bini and V. Y. Pan, Polynomial and matrix computations, vol. 1, Birkhauser, 1994.
Jacques Borowczyk, Sur la vie et Voeuvre de Francois Budan (1761-1840), His-toria Mathematica 18 (1991), 129-157.
British Standards Institute, The C++ Standard: Incorporating technical corrigendum no. 1, John Wiley and Sons, 2003.
139
Randal E. Bryant and David R. O'Hallaron, Computer systems: A programmer's perspective, Prentice Hall, 2003.
David Callahan, Steve Carr, and Ken Kennedy, Improving register allocation for subscripted variables, ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, 1990, pp. 53-65.
J. Cohen and M. Roth, On the implementation of Strassen's fast multiplication algorithm, Acta Informatica 6 (1976), 341-355.
G. E. Collins and J. R. Johnson, Quantifier elimination and the sign variation method for real root isolation, International Symposium on Symbolic and Algebraic Computation, ACM Press, 1989, pp. 264-271.
G. E. Collins and R. Loos, Real zeros of polynomials, Computer Algebra: Symbolic and Algebraic Computation (B. Buchberger, G. E. Collins, and R. Loos, eds.), Springer-Verlag, 2nd ed., 1982, pp. 83-94.
G. E. Collins and R. G. K. Loos, Specifications and index of SAC-2 algorithms, Tech. Report WSI-90-4, Wilhelm-Schickard-Institut fur Informatik, Universitat Tubingen, 1990.
George E. Collins, The computing time of the Euclidean algorithm, SIAM Journal on Computing 3 (1974), no. 1, 1-10.
George E. Collins and Alkiviadis G. Akritas, Polynomial real root isolation using Descartes' rule of signs, Proceedings of the 1976 ACM Symposium on Symbolic and Algebraic Computation (R. D. Jenks, ed.), ACM Press, 1976, pp. 272-275.
George E. Collins et al., SACLIB User's Guide, Tech. Report 93-19, Research Institute for Symbolic Computation, RISC-Linz, Johannes Kepler University, A-4040 Linz, Austria, 1993.
J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation 19 (1965), 297-301.
Intel Corporation, The Intel C+ + Compiler, http://www.intel.com/.
, A detailed look inside the Intel NetBurst micro-architecture of the Intel Pentium 4 processor, http://www.intel.com/, November 2000.
[26] , The IA-32 Intel Architecture Optimization: Reference Manual, h t t p : //www.intel.com/, 2004.
140
[27] , Intel Pentium D Processor 800 Sequence: Datasheet, 2006.
[28] D. Curry, Using C on the UNIX System, 1st ed., O'Reilly and Associates, Inc., 1989.
[29] K. Dowd and C. R. Severance, High performance computing, 2nd ed., O'Reilly and Associates, Inc., Sebastopol, CA, 1998.
[30] Jean-Guillaume Dumas, Pascal Giorgi, and Clement Pernet, FFPACK: Finite field linear algebra package, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2004, pp. 119-126.
[31] Arno Eigenwillig, Vikram Sharma, and Chee K. Yap, Almost tight recursion tree bounds for the Descartes method, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2006, pp. 71-78.
[32] I. Z. Emiris, B. Mourrain, and E. Tsigaridas, Real algebraic numbers: Complexity analysis and experimentations, Research Report 5897, INRIA, 2006.
[33] Gerald Farin, Curves and surfaces for computer aided geometric design, Academic Press, 1988.
[34] Richard Fateman, Comparing the speed of programs for sparse polynomial multiplication, ACM SIGSAM Bulletin 37 (2003), no. 1, 4-15.
[35] , Memory cache and Lisp: Faster list processing via automatically rearranging memory, ACM SIGSAM Bulletin 37 (2003), no. 4, 109-116.
[36] Akpodigha Filatei, Xin Li, Marc Moreno Maza, and Eric Schost, Implementation techniques for fast polynomial arithmetic in a high-level programming environment, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2006, pp. 93-100.
[37] M. Fowler, Yet another optimization article, IEEE Software 19 (2002), no. 3, 20-21.
[38] M. Frigo and S. G. Johnson, The design and implementation of FFTW3, Proceedings of the IEEE 93 (2005), no. 2, 216-231.
[39] Pierrick Gaudry, Assembly support for gmp on amd64, h t t p : //www. l o r i a . f r / ~gaudry/mpn_AMD64/.
[40] GNU Compiler Collection, h t tp : / /gcc .gnu .o rg / .
141
[41] Jiirgen Gerhard, Modular algorithms in symbolic summation and symbolic integration, Lecture Notes in Computer Science, vol. 3218, Springer-Verlag, 2004.
, Personal communication, 2005.
Torbjorn Granlund, GNU MP: The GNU Multiple Precision Arithmetic Library, Swox AB, September 2004, Edition 4.1.4.
, GNU MP: The GNU Multiple Precision Arithmetic Library, Swox AB, March 2006, Edition 4.2.
Guillaume Hanrot, Fabrice Rouillier, Paul Zimmermann, and Sylvain Petitjean, Uspensky 's algorithm, http: //www. loria . f r/equipes/vegas/qi/usp/usp . c, 2004.
John L. Hennessy, David A. Patterson, and David Goldberg, Computer architecture: A quantitative approach, 3rd ed., Morgan Kaufmann, 2002.
Karin Hogstedt, Larry Carter, and Jeanne Ferrante, Determining the idle time of a tiling, ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM Press, 1997, pp. 160-173.
, Selecting tile shape for minimal execution time, ACM Symposium on Parallel Algorithms and Architectures, ACM Press, 1999, pp. 201-211.
Tim Horel and Gary Lauterbach, UltraSPARC-Ill: Designing third-generation 64-bit performance, IEEE Micro 19 (1999), no. 3, 73-85.
Steven Huss-Lederman, Elaine M. Jacobson, Anna Tsao, Thomas Turnbull, and Jeremy R. Johnson, Implementation of Strassen's algorithm for matrix multiplication, Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, IEEE Computer Society, 1996, p. 32.
Innovative Computing Laboratory, PAPI: Performance Application Programming Interface, h t tp : / / i c l . c s .u tk .edu /PAPI .
International Standards Organization, http://www.iso.org, ISO/IEC 14882:2003: Programming languages—C++, 2003.
M. Jimenez, J. M. Llaberia, A. Fernandez, and E. Morancho, A general algorithm for tiling the register level, International Conference on Supercomputing, ACM Press, 1998, pp. 133-140.
142
[54] Marta Jimenez, Jose M. Llaberia, and Agustin Fernandez, Register tiling in nonrectangular iteration spaces, ACM Transactions on Programming Languages and Systems 24 (2002), no. 4, 409-453.
[55] , A cost-effective implementation of multilevel tiling, IEEE Transactions on Parallel and Distributed Systems 14 (2003), no. 10, 1006-1020.
[56] J. R. Johnson, Algorithms for polynomial real root isolation, Technical research report OSU-CISRC-8/91-TR21, The Ohio State University, Department of Computer and Information Science, 1991.
[57] , Algorithms for polynomial real root isolation, Quantifier Elimination and Cylindrical Algebraic Decomposition (B. F. Caviness and J. R. Johnson, eds.), Springer-Verlag, 1998, pp. 269-299.
[58] Jeremy R. Johnson, Werner Krandick, Kevin Lynch, David G. Richardson, and Anatole D. Ruslanov, High-performance implementations of the Descartes method, International Symposium on Symbolic and Algebraic Computation (J.-G. Dumas, ed.), ACM Press, 2006, pp. 154-161.
[59] Jeremy R. Johnson, Werner Krandick, and Anatole D. Ruslanov, Architecture-aware classical Taylor shift by 1, International Symposium on Symbolic and Algebraic Computation (M. Kauers, ed.), ACM Press, 2005, pp. 200-207.
[60] A. Karatsuba and Yu Ofman, Multiplication of multidigit numbers on automata, Sov. Phys. Dokl. 7 (1962), 595-596.
[61] I. Kodukula, N. Ahmed, and K. Pingali, Data-centric multi-level blocking, ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, 1997, pp. 346-357.
[62] Werner Krandick, Isoherung reeller Nullstellen von Polynomen, Wis-senschaftliches Rechnen (J. Herzberger, ed.), Akademie Verlag, Berlin, 1995, pp. 105-154.
[63] Werner Krandick and Kurt Mehlhorn, New bounds for the Descartes method, Journal of Symbolic Computation 41 (2006), no. 1, 49-66.
[64] Jeffrey M. Lane and R. F. Riesenfeld, Bounds on a polynomial, BIT 21 (1981), no. 1, 112-117.
[65] Maplesoft, Maple 9: Learning guide, 2003.
143
[66] Robert T. Moenck, Practical fast polynomial multiplication, Proceedings of the 1976 ACM Symposium on Symbolic and Algebraic Computation, ACM Press, 1976, pp. 136-148.
[67] M. B. Monagan, K. O. Geddes, K. M. Heal, G. Labahn, S. M. Vorkoetter, J. Mc-Carron, and P. DeMarco, Maple 9: Advanced programming guide, Maplesoft, 2003.
[68] , Maple 9: Introductory programming guide, Maplesoft, 2003.
[69] Peter L. Montgomery, Five, six, and seven-term Karatsuba-like formulae, IEEE Transactions on Computers 54 (2005), no. 3, 899-908.
[70] J. Moura, M. Puschel, J. Dongarra, and D. Padua (eds.), Special issue on program generation, optimization, and adaptation, Proceedings of IEEE, vol. 93, February 2005.
[71] B. Mourrain, J. P. Pavone, P. Trebuchet, and E. Tsigaridas, SYNAPS: A library for symbolic-numeric computation, Software presentation. MEGA 2005, Sardinia, Italy, May 2005, h t tp : / /www-sop . in r i a . f r /ga laad / log ic ie l s / synaps/.
[72] B. Mourrain, M. N. Vrahatis, and J. C. Yakoubsohn, On the complexity of isolating real roots and computing with certainty the topological degree, Journal of Complexity 18 (2002), no. 2, 612-640.
[73] Bernard Mourrain, Fabrice Rouillier, and Marie-Frangoise Roy, The Bernstein basis and real root isolation, Combinatorial and Computational Geometry (J. E. Goodman, J. Pach, and E. Welzl, eds.), Mathematical Sciences Research Institute Publications, vol. 52, Cambridge University Press, 2005, pp. 459-478.
[74] A. M. Ostrowski, Note on Vincent's theorem, Annals of Mathematics, Second Series 52 (1950), no. 3, 702-707, Reprinted in: Alexander Ostrowski: Collected Mathematical Papers, vol. 1, Birkhauser Verlag, 1983, pages 728-733.
[75] M. S. Paterson and L. Stockmeyer, On the number of nonscalar multiplications necessary to evaluate polynomials, SIAM Journal on Computing 2 (1973), 60-66.
[76] M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, SPIRAL: Code generation for DSP transforms, Proceedings of the
144
IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93 (2005), no. 2, 232-275.
[77] David G. Richardson and Werner Krandick, Compiler-enforced memory semantics in the SACLIB computer algebra library, International Workshop on Computer Algebra in Scientific Computing (V. G. Ganzha, E. W. Mayr, and E. V. Vorozhtsov, eds.), Lecture Notes in Computer Science, vol. 3718, Springer-Verlag, 2005, pp. 330-343.
[78] Fabrice Rouillier and Paul Zimmermann, Efficient isolation of a polynomial's real roots, Journal of Computational and Applied Mathematics 162 (2004), 33-50.
[79] David Saunders and Zhendong Wan, Smith normal form of dense integer matrices fast algorithms into practice, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2004, pp. 274-281.
[80] A. Schonhage, A. F. W. Grotefeld, and E. Vetter, Fast algorithms, B.I. Wissenschaftsverlag, Mannheim, 1994.
[81] A. Schonhage and V. Strassen, Schnelle Multvplikation grosser Zahlen, Computing 7 (1971), 281-292.
[82] Victor Shoup, NTL: A Library for doing Number Theory, h t t p : //www. shoup. n e t / n t l .
[83] , A new polynomial factorization algorithm and its implementation, Journal of Symbolic Computation 20 (1995), no. 4, 363-397.
[84] V. Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969), 354-356.
[85] Sun Microsystems, Sun Studio Collection, http://www.sun.com/.
[86] , UltraSPARC III Cu: User's manual, Ver. 2.2.1, http://www.sun. com/, 2004.
[87] J. V. Uspensky, Theory of equations, McGraw-Hill Book Company, Inc., 1948.
[88] Joachim von zur Gathen, Functional decomposition of polynomials: the tame case, Journal of Symbolic Computation 9 (1990), 281-299.
145
[89] Joachim von zur Gathen and Jiirgen Gerhard, Fast algorithms for Taylor shifts and certain difference equations, International Symposium on Symbolic and Algebraic Computation (W. W. Kiichlin, ed.), ACM Press, 1997, pp. 40-47.
[90] , Modern computer algebra, 2nd ed., Cambridge University Press, 2003.
[91] Larry Wall, Tom Christiansen, and Jon Orwant, Programming perl, 3rd ed., O'Reilly, 2000.
[92] R. C. Whaley and A. Petitet, Minimizing development and maintenance costs in supporting persistently optimized BLAS, Software: Practice and Experience 35 (2005), no. 2, 101-121.
[93] R. C. Whaley, A. Petitet, and J. J. Dongarra, Automated empirical optimization of software and the ATLAS project, Parallel Computing 27 (2001), no. 1-2, 3 -35.
[94] K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill, Is search really necessary to generate high-performance BLAS?, Proceedings of the IEEE 93 (2005), no. 2, 358-386.
[95] K. Yotov, K. Pingali, and P. Stodghill, Automatic measurement of memory hierarchy parameters, SIGMETRICS '05: Proceedings of the 2005 ACM SIG-METRICS international conference on Measurement and modeling of computer systems (New York, NY, USA), ACM Press, 2005, pp. 181-192.
[96] , X-ray: A tool for automatic measurement of hardware parameters, Second International Conference on the Quantitative Evaluation of Systems 2005, IEEE Computer Society, 2005, pp. 168-177.
[97] Paul Zimmermann, Personal communication, 2006.
[98] Dan Zuras, More on multiplying and squaring large integers, IEEE Transactions on Computers 43 (1994), no. 8, 899-908.
146
Vita
Anatole D. Ruslanov was born in St. Petersburg, Russia. He emigrated to the
United States in 1979 and became a US citizen in 1986. He attended University
of Pennsylvania (BA in Mathematics) and Drexel University (M.S. and Ph.D. in
Computer Science). Dr. Ruslanov is currently an assistant professor of computer
science at SUNY Fredonia's Department of Computer and Information Sciences.
His research interests include algorithm engineering, high-performance computing,
automated performance tuning, computer architecture, performance analysis and
benchmarking, symbolic computation, and algorithms for VLSI design automation.