A Complete Generalization of Atkin’s Square Root Algorithmsiftene/fi125(1)04.pdf · 2014. 5....

Fundamenta Informaticae 125 (2013) 71–94 71

DOI 10.3233/FI-2013-853

IOS Press

A Complete Generalization of Atkin’s Square Root Algorithm

Armand Stefan Rotaru∗

Institute of Computer Science, Romanian Academy

Carol I no. 8, 700505 Iasi, Romania

[email protected]

Sorin Iftene

Department of Computer Science, Alexandru Ioan Cuza University

General Berthelot no. 16, 700483 Iasi, Romania

[email protected]

Abstract. Atkin’s algorithm [2] for computing square roots inZ∗

p, wherep is a prime such that

p ≡ 5mod 8, has been extended by Muller [15] for the casep ≡ 9mod 16. In this paper we extendAtkin’s algorithm to the general casep ≡ 2s+1mod 2s+1, for anys ≥ 2, thus providing a completesolution for the casep ≡ 1 mod 4. Complexity analysis and comparisons with other methods arealso provided.

Keywords: Square Roots, Efficient Computation, Complexity

1. Introduction

Computing square roots in finite fields is a fundamental problem in number theory, with major applica-tions related to primality testing [3], factorization [17]or elliptic point compression [10]. In this paperwe consider the problem of finding square roots inZ

∗p, wherep is an odd prime. We have to remark that,

using Hensel’s lemma and Chinese remainder theorem, the problem of finding square roots modulo anycomposite number can be reduced to the case of prime modulus,by considering its prime factorization(for more details, see [4]).

∗Address for correspondence: Institute of Computer Science, Romanian Academy, Carol I no. 8, 700505 Iasi, Romania

72 A.S. Rotaru and S. Iftene / A Complete Generalization of Atkin’s Square Root Algorithm

According to Bach and Shallit [4, Notes on Chapter 7, page 194] and Lemmermeyer [13, Exercise1.16, Page 29], Lagrange was the first to derive an explicit formula for the casep ≡ 3 mod 4 in 1769.According to the same sources ([4, Exercise 1, page 188] and [13, Exercise 1.17, Page 29]), the casep ≡ 5 mod 8 was solved by Legendre in 1785. Atkin [2] also found a simple solution for the casep ≡ 5 mod 8 in 1992. In 2004, Muller [15] extended Atkin’s algorithm tothe casep ≡ 9 mod 16and left further developing Atkin’s algorithm as an open problem. In this paper we extend Atkin’salgorithm to the casep ≡ 2s + 1 mod 2s+1, for any s ≥ 2, thus providing a complete solution forthe casep ≡ 1 mod 4. Muller’s algorithm and our generalization use quadraticnon-residues, and thus,they are probabilistic algorithms. We remark that several deterministic approaches for computing squareroots modulo a primep have also been presented in the literature. Schoof [19] proposed an impracticaldeterministic algorithm of complexityO((log2 p)

9). Sze [21] has recently developed a deterministicalgorithm for computing square roots which is efficient (itscomplexity isO((log2 p)

2))) only for certainprimesp.

The paper is structured as follows. Section 2 is dedicated tosome mathematical preliminaries onquadratic residues and square roots. Section 3 presents Atkin’s algorithm and its extension (Muller’salgorithm), both based on computing square roots of−1 modulop. We present our generalization inSection 4. Its performance, efficient implementation and comparisons with other methods are presentedin Section 5. In the last section we briefly discuss the conclusions of our paper and the possibility ofadapting our algorithm for other finite fields.

2. Mathematical Background

In this section we will present some basic facts on quadraticresidues and square roots. For simplicity ofnotation, from this point forward we will omit the modular reduction, but the reader must be aware thatall computations are performed modulop if not explicitly stated otherwise.

Let p be a prime anda ∈ Z∗p. We say thata is aquadratic residue modulop if there existsb ∈ Z

∗p

with the propertya = b2. Otherwise,a is aquadratic non-residue modulop. It is easy to see that theproduct of two residues is a residue and that the product of a residue with a non-residue is a non-residue.

If b2 = a thenb will be referred to as asquare root ofa (modulop) and we will simply denote thisfact byb =

√a. We have to remark that ifa is a quadratic residue modulop, p prime, thena has exactly

two square roots - ifb is a square root ofa, thenp−b is the other one. In particular,1 has the square roots1 and−1 (in this case,−1 will be regarded as beingp−1) or, equivalently,a2 = 1 ⇔ (a = 1∨a = −1).

TheLegendre symbolof a modulop, denoted as

(a

p

)

, is defined to be equal to±1 depending on

whethera is a quadratic residue modulop. More exactly,

(a

p

)

=

{

1, if a is a quadratic residue modulop;

−1, otherwise.

Euler’s criterion states that, for any primep anda ∈ Z∗p, the following relation holds:

ap−12 =

(a

p

)

.

A.S. Rotaru and S. Iftene / A Complete Generalization of Atkin’s Square Root Algorithm 73

Euler’s criterion provides a method of computing the Legendre symbol ofa modulop using an exponenti-ation modulop, whose complexity isO((log2 p)

3). There are faster methods for evaluating the Legendre

symbol - see, for example [8], in which are presented algorithms of complexityO((log2 p)

2

log2 log2 p) for com-

puting the Jacobi symbol (the Jacobi symbol is a generalization of the Legendre symbol to arbitrarymoduli).

Another useful property is that

(2

p

)

= (−1)p2−1

8 , that implies that2 is a quadratic residue modulo

p if and only if p ≡ ±1mod 8.

If p is prime,p ≡ 3 mod 4, anda ∈ Z∗p is a quadratic residue modulop thenb = a

p+14 is a square

root of a modulop. Indeed, in this case,b2 = ap+12 = a · a p−1

2 = a ·(a

p

)

= a · 1 = a. Thus, in this

case, finding square roots modulop requires only a single exponentiation modulop. In the next sectionswe will focus on the casep prime,p ≡ 1mod 4.

3. Square Root Algorithms based on Computing√−1

In this section we present two methods for computing square roots for the casesp ≡ 5 mod 8 andp ≡ 9mod 16, both based on computing square roots of−1 modulop.

3.1. Atkin’s Algorithm

Let p be a prime such thatp ≡ 5 mod 8 anda a quadratic residue modulop. Atkin’s idea [2] is toexpress

√a as

√a = αa(β − 1) whereβ2 = −1 and2aα2 = β. Indeed, in this case,(αa(β − 1))2 =

a(−2aα2β) = a(−β2) = a. Moreover, in order to easily determineα, it will be convenient thatβ hasthe formβ = (2a)k, with k odd. Thus, the major challenge is to find

√−1 of the mentioned form.

By Euler’s criterion, the relation(2a)p−12 = −1 holds (a is a quadratic residue, but2 is a quadratic

non-residue, therefore2a is a quadratic non-residue), so we can chooseβ asβ = (2a)p−14 andα as

α = (2a)p−14 −1

2 = (2a)p−58 .

The resulted algorithm is presented in Figure 1.

Atkin’s Algorithm( p,a)

input: p prime such thatp ≡ 5mod 8,a ∈ Z

∗p a quadratic residue;

output: b, a square root ofa modulop;begin

1. α := (2a)p−58 ;

2. β := 2aα2;3. b := αa(β − 1);4. return b

end.

Figure 1: Atkin’s algorithm


Atkin’s algorithm requires one exponentiation (in Step 1) and four multiplications (two multiplica-tions in Step 2 and two multiplications in Step 3).

3.2. Muller’s Algorithm

Let p be a prime such thatp ≡ 9 mod 16 anda a quadratic residue modulop. Muller [15] has extendedAtkin’s algorithm by expressing

√a as

√a = αad(β − 1) whereβ2 = −1 and2ad2α2 = β. Indeed, in

this case,(αad(β − 1))2 = a(−2ad2α2β) = a(−β2) = a. Moreover, in order to easily determineα, itwill be convenient thatβ has the formβ = (2ad2)k, with k odd.

By Euler’s criterion, the relation(2a)p−12 = 1 holds (a and2 are quadratic residues, therefore2a is

a quadratic residue). We have two cases:

(I) (2a)p−14 = −1 - in this case we can chooseβ asβ = (2a)

p−18 andα asα = (2a)

p−18 −1

2 = (2a)p−916

(d = 1);

(II) (2a)p−14 = 1 - in this case we need a quadratic non-residued - by Euler’s criterion,d

p−12 = −1

and, thus,(2ad2)p−14 = −1, so we can chooseβ asβ = (2ad2)

p−18 andα asα = (2ad2)

p−18 −1

2 =

(2ad2)p−916 .

The above presentation is in fact a slightly modified variantof the original one - for Case (I), Mullerused an arbitrary residued. Kong et al. [11] have remarked that usingd = 1 in this case leads toan important improvement of the performance of original Muller’s algorithm, by requiring only oneexponentiation for half of the squares inZ∗

p (Case (I)) and two for the rest (Case (II)).The resulted algorithm is presented in Figure 2.

In case(2a)p−14 = −1, Muller’s algorithm requires one exponentiation (Step 1)and five multiplica-

tions (two multiplications in Step 2, one multiplication inStep 3 and two multiplications in Step 4). In

case(2a)p−14 = 1, Muller’s algorithm, besides the operations in Steps 1-3,requires one more exponen-

tiation (Step 8) and eight more multiplications (one multiplication in Step 8, four multiplications in Step9 and three multiplications in Step 10. Additionally, Step 7requires, on average, two quadratic character

evaluations (generate randomlyd ∈ Z∗p until

(d

p

)

= −1 - because half of the elements are quadratic

non-residues, two generations are required on average). Itis interesting to remark that Ankeny [1] hasproven that, by assuming the Extended Riemann Hypothesis (ERH), the least quadratic non-residue mod-ulo p is inO((log2 p)

2). As a consequence, in this case, the presented probabilistic algorithm for findinga quadratic non-residue can be transformed into a deterministic polynomial time algorithm of complexityO((log2 p)

4).

4. A Complete Generalization of Atkin’s Square Root Algorithm

In this section we extend Atkin’s algorithm to the casep ≡ 2s + 1 mod 2s+1, for any s ≥ 2, thusproviding a complete solution for the casep ≡ 1 mod 4. For any primep, with p ≡ 1 mod 4, we canexpressp − 1 asp − 1 = 2st, wheres ≥ 2 andt is odd. If we writet ast = 2t′ + 1, we obtain thatp = 2s+1t′ + 2s + 1 that implies thatp ≡ 2s + 1mod 2s+1.


Muller’s Algorithm( p,a)

input: p prime such thatp ≡ 9mod 16,a ∈ Z



1. α := (2a)p−916 ;

2. β := 2aα2;3. if β2 = −14. then b := αa(β − 1);5. else

6. begin

7. generate d, a quadratic non-residue modulop;

8. α := αdp−98 ;

9. β := 2ad2α2;10. b := αad(β − 1);11. end

12. return bend.

Figure 2: Muller’s algorithm

We will express√a as

√a = αa(β−1)dnorm, whereβ2 = −1, d is a quadratic non-residue modulo

p, norm ≥ 0, and2ad2·normα2 = β. Indeed, in this case,(αa(β − 1)dnorm)2 = a(−2ad2·normα2β) =a(−β2) = a. Moreover, in order to easily determineα, it will be convenient thatβ has the formβ =(2ad2·norm)k, with k odd.

The key point of our generalization is

Base Case: (2ad2·norm)p−1

2s−1 = −1, for somenorm ≥ 0.

In this case, becausep−12s is odd, we can chooseβ asβ = (2ad2·norm)

p−12s , α as

α = (2ad2·norm)p−12s

−1

2 = (2ad2·norm)p−(2s+1)

2s+1 = (2ad2·norm)t−12 .

In contrast to Muller’s impractical attempt of further generalizing Atkin’s approach ([15, Remark 2]),we focus on finding an adequate value fornorm, the exponent ofd such that theBase Case is satisfied.

In order to derive the value ofnorm, we use the following results:

Theorem 4.1. Let p be an odd prime,p− 1 = 2st (s ≥ 3, t odd),a a quadratic residue modulop, andda quadratic non-residue modulop. Then, for all1 ≤ i ≤ s− 1, the following statement holds

(∃norm′ ∈ N)((2ad2·norm′

)p−1

2i = 1) ⇒ (∃norm ∈ N)((2ad2·norm)p−1

2s−1 = −1)


Proof:We use induction oni.

Initial Case - For i = s − 1 the reasoning is very simple. If there is a positive integernorm′

such that(2ad2·norm′

)p−1

2s−1 = 1 then, using thatdp−12 = −1 (or, (d2

s−2)

p−1

2s−1 = −1), we obtain

that (2ad2·norm′

d2s−2

)p−1

2s−1 = −1, and, furthermore,(2ad2·(norm′+2s−3))

p−1

2s−1 = −1. Thus, wemay choosenorm = norm′ + 2s−3.

Inductive Case - Let us consider an arbitrary numberi, 1 ≤ i < s − 1. We assume that thestatement holds for the casei+ 1 and we will prove it for the casei.

If there is a natural numbernorm′ such that(2ad2·norm′

)p−1

2i = 1, or, ((2ad2·norm′

)p−1

2i+1 )2 = 1,

then(2ad2·norm′

)p−1

2i+1 = ±1. We have two cases:

– If (2ad2·norm′

)p−1

2i+1 = 1 then, using the inductive hypothesis, we directly obtain that(∃norm ∈N)((2ad2·norm)

p−1

2s−1 = −1);

– If (2ad2·norm′

)p−1

2i+1 = −1 then, using thatdp−12 = −1 (or, equivalently,(d2

i

)p−1

2i+1 = −1)

we obtain that(2ad2·norm′

d2i

)p−1

2i+1 = 1, and, furthermore,(2ad2·(norm′+2i−1))

p−1

2i+1 = 1.Finally, using the inductive hypothesis, we obtain that therequired statement holds. ⊓⊔

The previous theorem leads to the following:

Corollary 4.2. Let p be an odd prime,p− 1 = 2st (s ≥ 2, t odd),a a quadratic residue modulop, andd a quadratic non-residue modulop. Then there existsnorm ∈ N such that

(2ad2·norm)p−1

2s−1 = −1.

Proof:Fors = 2, we obtain directlynorm = 0, because in this case2 is a quadratic non-residue modulop and

the relation(2a)p−12 = −1 holds.

For s ≥ 3, 2 is a quadratic residue and thus, we have(2a)p−12 = 1. Using Theorem 4.1, fori = 1

(norm′ = 0) we obtain that there isnorm ∈ N such that

(2ad2·norm)p−1

2s−1 = −1. ⊓⊔

Therefore, all other possible cases can be recursively reduced to theBase Case as presented above.In order to further clarify the points made so far, we will nowgive an algorithmic description of ourgeneralization. We will use a special subroutine namedFindPlace (presented in Figure 3), in which,

starting with certain values fora andnorm that satisfy(2ad2·norm)p−1

2i = 1, for somei, we will search

for a placej as close as possible tos− 1 such thattemp = (2ad2·norm)p−1

2j = ±1.

Furthermore, we will also formulateBase Case as a subroutine in Figure 4.Finally, the main part of our algorithm is presented in Figure 5.


FindPlace(a, norm)

begin

1. if norm = 0 then temp := (2a)t

2. else temp := (2ad2·norm)t;3. j := s;4. repeat

5. j := j − 1;6. temp := temp2;7. until (temp = 1 ∨ temp = −1)8. return (j, temp)

end.

Figure 3: FindPlace Subroutine

BaseCase(a, norm)

begin

1. α := (2ad2·norm)t−12 ;

2. β := (2ad2·norm)α2;3. b := αa(β − 1)dnorm;4. return b

end.

Figure 4: BaseCase Subroutine

Remark 4.3. For the clarity of the presentation, we believe it is also necessary to make some commentsand prove some statements on the Generalized Atkin Algorithm and its subroutines:

1. The variablenorm contains the current value of the normalization exponent.

2. Some useful properties of the subroutineFindPlace are presented next:

(a) If the outputted valuej of the subroutineFindPlace is not equal tos − 1, then the corre-sponding valuetemp will be −1.

Proof:Becausej < s − 1 then at least two iterations ofrepeat until have been performed (becauseinitially j = s and thenj is decremented in each iteration). If we assume by contradiction thatthe final value oftemp is 1, then the previous valuetemp satisfiestemp = ±1 (becausetemp =temp2 in Step 6), and, thus, the algorithm had to terminate at the previous iteration. ⊓⊔

(b) Let (j, temp) and (j′, temp′) be the outputs of two consecutive calls of the subroutineFindPlace. Thenj < j′.


Generalized Atkin Algorithm( p,a)

input: p prime such thatp ≡ 1mod 4a ∈ Z



1. determine s ≥ 2 andt odd such thatp− 1 = 2st;2. generate d, a quadratic non-residue modulop;3. norm := 0;4. (j, temp) := FindPlace(a, norm);5. while (j < s− 1)6. begin

7. norm := norm+ 2j−2;8. (j, temp) := FindPlace(a, norm);9. end

10. if (temp = −1) then BaseCase(a, norm)11. if (temp = 1) then12. begin

13. norm := norm+ 2s−3;14. BaseCase(a, norm);15. end

end.

Figure 5: Generalized Atkin Algorithm

Proof:Let us first point out thatj < s − 1 (otherwise, ifj = s − 1, there will not be another call ofFindPlace, since the algorithm will end with a call ofBaseCase), which implies thatj+1 ≤ s−1.

Therefore, we obtaintemp = (2ad2·norm)p−1

2j = −1. Furthermore, we have(2ad2·norm′

)p−1

2j = 1,

which implies that(2ad2·norm′

)p−1

2j+1 = ±1, leading toj+1 ≤ j′ (becausej′ is the greatest element

less thans− 1 such that(2ad2·norm′

)p−1

2j′ = ±1). ⊓⊔

3. If p ≡ 5 mod 8, i.e.,s = 2, thenFindPlace will be called exactly once (witha andnorm = 0)and it will outputj = s − 1 = 1 andtemp = −1 - in this case, the subroutineBaseCase willdirectly lead to the final result (no normalization is required). Thus, we have obtained Atkin’salgorithm as a particular case of our algorithm.

4. If p ≡ 9 mod 16, i.e.,s = 3, thenFindPlace will be called exactly once (witha andnorm = 0)and it will outputj = s− 1 = 2 andtemp = ±1. Two subcases are possible:

• In casetemp = −1, the subroutineBaseCase will lead directly to the final result (no nor-malization is required);


• In casetemp = 1, the normalization exponent will be updated asnorm = 0+23−3 = 1 andthe subroutineBaseCase will be called. Consequently, the final result will be computed asb := αa(β − 1)d1 (Step 3 ofBaseCase).

Thus, we have obtained Muller’s algorithm as a particular case of our algorithm.

5. Efficient Implementation and Performance Analysis

We start with the average-case and worst-case complexity analysis of our initial algorithm and then wediscuss several improvements for efficient implementation. Finally we present several comparisons withthe most important generic square root computing methods, namely Tonelli-Shanks and Cippola-Lehmer.

5.1. Average-Case and Worst-Case Complexity Analysis

We will consider the casess ≥ 4 (for s = 2, s = 3, we obtain, Atkin’s algorithm, and, respectively,Muller’s algorithm, whose complexities have been discussed in Section 3). Our algorithm determines thevalue ofnorm by calling the subroutineFindPlace for each1 digit in the binary expression ofnorm.Therefore, the algorithm makesHw(norm) calls toFindPlace, whereHw(x) denotes the Hammingweight ofx (i.e., the number of1’s in x).

Let E denote one exponentiation,M - one multiplication, andS - one squaring (all these operationsare performed modulop). Our subroutines will involve:

• FindPlace - if the output is(j, temp) then at most 2E+1M+(s − j) S;

• BaseCase - at most 3E+6M+1S.

We exclude the complexity of generating a quadratic non-residued. All the other computations canbe considered negligible (ifnorm is represented in base2 then the stepnorm := norm+ 2j−2 impliesonly setting a certain bit to1). In the average case, we haveHw(norm) = s−2

2 , which means thatour algorithm will includes−2

2 calls toFindPlace and a call toBaseCase. Thus, the total number ofoperations is, on average, the following:

s−22 (2E+ 1M) +

∑s−1j=2(s−j)

2 S+ 3E+ 6M+ 1S =

(s− 2)E + (s−2)2 M+ (s−1)(s−2)

4 S+ 3E + 6M+ 1S =

(s+ 1)E + s+102 M+ s2−3s+6

4 S

In contrast, in the worst case,norm will haveHw(norm) = s− 2 (i.e., all the bits fromnorm willbe equal to1), resulting in(s − 2) calls toFindPlace and a call toBaseCase. The total number ofoperations now becomes the following:

(s− 2)(2E + 1M) +∑s−1

j=2(s− j)S + 3E + 6M + 1S =

2(s − 2)E+ (s− 2)M + (s−1)(s−2)2 S+ 3E + 6M + 1S =

(2s − 1)E+ (s+ 4)M + s2−3s+42 S

Once more, we do not count the generation of a quadratic non-residued. Consequently, both theaverage-case and the worst-case complexity of our initial algorithm are inO((log2 p)

4).


5.2. More Efficient Implementation

It is obvious that several steps (especially the steps that involves exponentiations) of our algorithm can beperformed much more efficiently compared to their raw implementation. A first solution is to precomputeseveral powers ofd, to keep track ofdnorm and(dnorm)

t−12 asnorm is updated and efficiently recompute

the value oftemp from Step 2 ofFindPlace by using the previous values. Moreover, in this case, thefinal exponentiations (fromBaseCase) can also be performed efficiently.

We will now examine the computations behind our algorithm more closely, in order to point outpossible improvements, if precomputations can be afforded. We begin by defining the elementsDj ,0 ≤ j ≤ s − 1, Dj = D2j , D = dt andAj , 0 ≤ j ≤ s − 2, Aj = A2j , A = (2a)t. Let us denoteAj · Dtemp norm by < j, temp norm >, wheretemp norm is in binary form. In our algorithm, wedetermine the value ofnorm = (fs−3...f0)2 by successively computing the digitsf0, f1 ,..., fs−3, sothat:

Ts−2 = < s− 2, f0

s−1 0′s︷︸︸︷

00.........00 > = 1

Ts−3 = < s− 3, f1f0 00......00︸︷︷︸

s−2 0′s

> = 1

...

T2 = < 2, fs−4...f1f0000 > = 1

T1 = < 1, fs−3.....f1f000 > = −1

We remark that the last elementT1 is exactly the element from theBase Case: (2ad2·norm)p−1

2s−1 .The reader will notice that for any0 ≤ j ≤ s − 2, we would computeTj = Aj · D(fs−2−j ...f1f00...0)2

in a naive manner, by multiplyingAj with all the (2i)th powers ofD corresponding to the1 bits fromfs−2−j...f1f00...0. In order to reduce the number of modular multiplications, let us choose a fixed, smallintegerk, with k ≥ 1, and consider thek termsTj , Tj−1, ..., Tj−k+1, wherej − k + 1 > 1. We obtainthe following sequence:

Tj = < j, fs−2−jfs−3−j...f1f0 0......0︸︷︷︸

j+1 0′s

> = 1

Tj−1 = < j − 1, fs−2−j+1fs−2−jfs−3−j...f1f0 0......0︸︷︷︸

j 0′s

> = 1

...

Tj−k+2 = < j − k + 2, fs−2−j+k−2...fs−2−jfs−3−j...f1f0 0......0︸︷︷︸

j−k+3 0′s

> = 1

Tj−k+1 = < j − k + 1, fs−2−j+k−1...fs−2−jfs−3−j...f1f0 0......0︸︷︷︸

j−k+2 0′s

> = 1


Importantly, the sequencefs−3−j...f1f0 appears in all of the above terms. Let us denote the term

Dfs−3−j ...f1f0

j−k+2 0′s︷︸︸︷

00......00 by aux.

We notice that theTj , Tj−1, ..., Tj−k+1 can be computed in the following way, once we have deter-minedTj+1 (and, implicitly,fs−3−j, ...., f1, f0):

Tj = Aj ·Dfs−2−j

s−1︸︷︷︸

S0

· aux2k−1

Tj−1 = Aj−1 ·Dfs−2−j+1

s−1︸︷︷︸

S1

·Dfs−2−j

s−2︸︷︷︸

S0

· aux2k−2

...

Tj−k+2 = Aj−k+2 ·Dfs−2−j+k−2

s−1︸︷︷︸

Sk−2

· · · · · · · · · · · ·Dfs−2−j+1

s−k+2︸︷︷︸

S1

·Dfs−2−j

s−k+1︸︷︷︸

S0

· aux2

Tj−k+1 = Aj−k+1 ·Dfs−2−j+k−1

s−1︸︷︷︸

Sk−1

·Dfs−2−j+k−2

s−2︸︷︷︸

Sk−2

· · · · · · · · ·Dfs−2−j+1

s−k+1︸︷︷︸

S1

·Dfs−2−j

s−k︸︷︷︸

S0

· aux

We added underbraces with subscripts to the terms in order tohighlight the fact that it is useful to seeterms which have the same exponent as being part of a larger set. We will show how we can efficientlygenerate the powers ofaux and the setsSw, for 0 ≤ w ≤ k − 1.

Firstly, we computeaux in the regular manner, and thenaux2i

, for 1 ≤ i ≤ k − 1, throughk − 1modular squarings. This way, we use onlyk−1modular squarings, instead of(k−1)·Hw(fs−3−j ...f1f0)regular modular multiplications.

Secondly, we still have to compute the sets of termsSw = {Dfs−2−j+w

s−k+w+z |0 ≤ z ≤ k − 1 − w},for 0 ≤ w ≤ k − 1. Intuitively, the setSw contains all the terms locatedw positions below the maindiagonal, for0 ≤ w ≤ k − 1. For eachw, if fs−2−j+w = 0 (asTs−2−j+w = 1), we do not have tocompute anything becauseSw = {1}, while if fs−2−j+w = 1 (asTs−2−j+w = −1), each set can beeasily generated by takingDs−k+w and applyingk − w − 1 modular squarings.

Inner Loop

1. computeaux2. setTj−k+1 := Aj−k+1 · aux;3. forw = 0 to k − 1 do:4. determinefs−2−j+w

5. updateTj−k+1 by settingTj−k+1 := Tj−k+1 ·Ds−k+w

6. for i = 2 to k − w − 1 do:7. updateTj−k+i by settingTj−k+i := T 2

j−k+i−1

Figure 6: Inner Loop


Thirdly, by storing the termsTj , Tj−1, ..., Tj−k+1 we can efficiently combine the two aforementionedimprovements as presented in Figure 6.

For each value ofw between0 andk−1, running the inner loop generates both the necessary powersof aux and the setSw. Once the outer loop is completed, we have determinedk new digits from thebinary representation ofnorm. We repeat this procedure until we know all the bits ofnorm.

Finally, we can also simplify the last computations of our initial algorithm. The standard procedurewould be to first calculate the termsα = (2ad2·norm)

t−12 andβ = (2ad2·norm)α2, and then to generate

the square rootb = αa(β − 1)dnorm. However, if we elaborate the expression ofb we obtain

b = aαdnorm(β − 1)

= a(2a)t−12 d2·norm

t−12 dnorm((2a)t(d2·norm)t − 1)

= a(2a)t−12 dt·norm((2a)t(dt·norm)2 − 1)

Once we have computed(2a)t−12 , we can then easily modify the final run of the inner loop in order

to generateDnorm = (dt)norm and to compute the value ofb. When combined, our suggestions lead toa significantly improved version of our initial algorithm. The precomputation stage is as follows:

Precomputation(p, a, k)

input: p prime such thatp ≡ 1mod 4, p− 1 = 2st, t odd;k, 1 ≤ k ≤ s, a precomputation parameter;

output: Dj, 0 ≤ j ≤ s− 1, Dj = D2j , D = dt, d quadratic non-residue,

auxA = (2a)t−12 , A = (2a)t andAs−1−k·i, 1 ≤ i ≤ q,

whereq = ⌊s−2k

⌋, As−1−k·i = (2a)t2s−1−k·i

begin

1. generate and store d (by any means available);2. compute and storeD andDi, 0 ≤ j ≤ s− 1

(by square-and-multiply exponentiation);3. compute and store auxA, A andAs−1−k·i, 1 ≤ i ≤ q

(by square-and-multiply exponentiation);end

Figure 7: Precomputation Subroutine

We have precomputed all the required powers ofD, but only certain powers ofA. It is not necessaryto keep all the powers ofA, since the missing powers can be generated as they are needed. This isbecause the algorithm behind the third improvement uses only one stored power ofA and implicitlyemploysk − 1 other powers ofA, which are kept only for the duration of the outer loop. We have alsocomputed the termauxA = (2a)

t−12 , which is part of the final improvement. Moreover, we assume that

we have enough memory capacity to storek numbers, namelyACCh, where1 ≤ h ≤ k. These numbersare exactlyTj, Tj−1, ..., Tj−k+1, as used in the description ofInner Loop.


The main part of our improved algorithm is presented in Figure 8. Before running the actual algo-rithm, thePrecomputation subroutine must be called. Note, however, that ifp is a priori known, Steps1 and 2 fromPrecomputation need to be performed only once (and this may be done in advance),while Step 3 must be repeated for eacha.

Improved Generalized Atkin Algorithm( p, a, k)

input: p prime such thatp ≡ 1mod 4, p− 1 = 2st, t odd;a ∈ Z



1. step := s− 2;2. auxnorm := 0 = (es−1...e0)2; (the finalauxnorm is 4 · norm)3. q := ⌊s−2

k⌋;

4. rem := (s− 2) mod k + 1;5. for i = 1 to q do6. begin

7. Complete Accumulator Update(step, k);8. Complete Inner Loop(step, k);9. end

11. Final Accumulator Update and Inner Loop(step, k, rem);12. b := a · auxA · auxACC · (A · aux2ACC − 1);13. return b

end.

Figure 8: Improved Generalized Atkin Algorithm

The first two subroutines correspond to theInner Loop (described in Figure 6) in the followingmanner:

• Complete Accumulator Update (presented in Figure 9) implements Steps 1 and 2, computingaux andTj−k+1.

• Complete Inner Loop (presented in Figure 10) implements the loop in Steps 3 through 7, com-puting the bitsfs−2−j,. . . ,fs−3−j+k.

Final Accumulator Update and Inner Loop (presented in Figures 11, 12) is an incompletecombination of aComplete Accumulator Update and aComplete Inner Loop, for determiningthe remaining bits ofnorm, sinces−2 may not be an exact multiple ofk. Moreover, a slight adjustmentis made in order to obtain the termauxACC = Dnorm = (dt)norm. We consider the cases = 2separately and setauxACC := 1 since this case does not fit in the general framework. Fors > 2, thelast part of this subroutine (Steps 15-34) computes the termT1 which must be treated individually, asT1 = −1 while all otherTi’s are equal to 1, for2 ≤ i ≤ s− 2.


Complete Accumulator Update(step, k)

begin

1. ACC1 := Astep−k+1

s−1∏

j=step+2

(Dj−k)ej ;

2. for j = 2 to k do ACCj := ACC2j−1;

end

Figure 9: Complete Accumulator Update Subroutine

Complete Inner Loop(step, k)

begin

1. for j = k downto 1 do2. begin

3. auxnorm := auxnorm/2;4. if ACCj = −1 then5. begin

6. ACC1 := ACC1 ·Ds−j;7. for h = 2 to j − 1 do ACCh := ACC2

h−1;8. es−1 := 1;9. end

10. step := step− 1;11. end

end

Figure 10: Complete Inner Loop Subroutine


Final Accumulator Update and Inner Loop(step, k, rem)

begin

1. auxACC :=s−1∏

j=step+2

(Dj−step−2)ej ;

2. ACC1 := A · aux2ACC ;3. for j = 2 to rem do ACCj := ACC2

j−1;4 for j = rem− 1 downto 2 do5. begin

6. auxnorm := auxnorm/2;7. if ACCj = −1 then8. begin

9. auxACC := auxACC ·Ds−1−j ;10. ACC1 := A · aux2ACC ;11. for h = 2 to j − 1 do ACCh := ACC2

h−1;12. es−1 := 1;13. end

14. end

15. if rem = 1 then16. if s = 2 then auxACC := 1;17. else

18. if es−1 = 0 then19. begin

20. auxACC := auxACC ·Ds−3;21. es−1 := 1;22. end

23. else begin

24. auxACC := auxACC ·Ds−3 ·Ds−2 ·Ds−1;25. es−1 := 0;26. end

Figure 11: Final Accumulator Update and Inner Loop Subroutine


27. else begin

28. auxnorm := auxnorm/2;29. if ACC2 = 1 then30. begin

31. auxACC := auxACC ·Ds−3;32. es−1 := 1;33. end

34. end

end

Figure 12: Final Accumulator Update and Inner Loop Subroutine (continued)

Example 5.1 illustrates the application of our improved algorithm.

Example 5.1. Let us considerp = 12289 (s = 12, t = 3) anda = 2564 (2564 is a quadratic residuemodulo12289). We choosed = 19, k = 3 (therefore,q = 3), and obtain the following values :

i 0 1 2 3 4 5

Di 6859 3589 2049 7852 12280 81

Ai 8835 9786 9908 3932 1062 9545

i 6 7 8 9 10 11

Di 6561 10643 5736 4043 1479 12288

Ai 8668 11567 5146 10810 12288 -

However, we will only store theDi’s, for 0 ≤ i ≤ 11, as well asA0, A2, A5, A8 andauxA = 5128.We obtainstep = 10, auxnorm = 0 andrem = 2.

For i = 1, we update the accumulators so thatACC1 = A8 = 5164, ACC2 = A9 = 10810 andACC3 = A10 = −1.

Entering theComplete Inner Loop, we have:

• sinceACC3 = −1, we haveACC1 = ACC1 · D9 = 1 andACC2 = ACC21 = 1. Moreover,

e11 = 1, auxnorm = 0/2 + 2048 = 2048 andstep = 9;

• sinceACC2 = 1, we getauxnorm = 2048/2 = 1024 andstep = 8;


For i = 2, we update the accumulators so thatACC1 = 1, ACC2 = 1 andACC3 = 1.

Entering theComplete Inner Loop, we have:





For i = 3, we update the accumulators so thatACC1 = 8246, ACC2 = 1479 andACC3 = −1.

Entering theInner Loop, we have:

• sinceACC3 = −1, we haveACC1 = ACC1 · D9 = 10810 andACC2 = ACC21 = −1.

Furthermore,e11 = 1, auxnorm = 64/2 + 2048 = 2080 andstep = 3;

• sinceACC2 = −1, we haveACC1 = ACC1 · D10 = 1. Furthermore,e11 = 1, auxnorm =2080/2 + 2048 = 3088 andstep = 2;


We now perform theFinal Accumulator Update and Inner Loop. Thus, we obtainauxACC =D0 ·D6 · D7 = 1490, ACC1 = A · aux2ACC = −1 andACC2 = ACC2

1 = 1. SinceACC2 = 1, weobtaine11 = 1, auxnorm = 1544/2 + 2048 = 2820 (thus,norm = 2820/4 = 705) andauxACC =auxACC · D9 = 2460. The final computation gives usb = a · auxA · auxACC · (A · aux2ACC − 1) =2564 · 5128 · 2460 · (10810 − 1) = 253.

5.3. Average-Case and Worst-Case Complexity Analysis for the Improved Algorithm

In the average case, we obtain the following complexities, based on the fact thatnorm has arounds/2bits equal to1 in its representation:

• If p is a priori known,Precomputation takes1E for the terms involvingA (the computation ofthe terms involvingD can be performed in advance). Ifp is not a priori known,Precomputationtakes2E, which means1E for terms involvingD and1E for the terms involvingA.

• Complete Accumulator Update takess4M and(k−1)S (since, on average, we uses/2 bits fromnorm, either of which can be0 or 1, with equal probability).

• Complete Inner Loop takesk2M + k(k−1)

4 S (since we usek bits fromnorm, either of which canbe0 or 1, with equal probability).

• Final Accumulator Update and Inner Loop takes2 · k2M + s

4M + (k − 1)S + k2M + k(k−1)

4 S.

• The final computation ofb takes4M + 1S.

The estimate does not include the generation of a quadratic non-residued. In general, the compu-tation takes about2E + s

k( s4M + kS + k

2M + k(k−1)4 S) + (k + 4)M + 1S. This value is around2E + 3s

4 S +s2M + 1

4(s2

k)M + 1

4 (sk)S + kM. Takingk = ⌈√s ⌉ (the optimal choice) leaves us with2E + 3s4 S + s

2M +14(s⌈

√s ⌉)M + 1

4(s⌈√s ⌉)S + ⌈√s ⌉M.


If p is a priori known, we obtain1E + 3s4 S + s

2M + 14(s⌈

√s ⌉)M + 1

4(s⌈√s ⌉)S + ⌈√s ⌉M. In this case,

we needs precomputed elements and memory for just2⌈√s ⌉ additional elements.Moving on to the worst case, we consider the fact thatnorm’s binary representation has roughlys

bits which are equal to1. This results in the following complexities:

• Precomputation - same as for the average case.

• Complete Accumulator Update takess2M and(k−1)S (since, on average, we uses/2 bits fromnorm, and all ofnorm’s bits are equal to1).

• Complete Inner Loop takeskM + k(k−1)2 S (since we usek bits fromnorm, and all ofnorm’s

bits are equal to1).

• Final Accumulator Update and Inner Loop takes2kM + s2M + (k − 1)S + kM + k(k−1)

2 S.

• The final computation ofb takes4M + 1S - same as for the average case.

Again, we exclude the generation of a quadratic non-residued. The computation takes at mostabout2E + s

k( s2M + kS + kM + k(k−1)

2 S) + (2k + 4)M + 1S. This value is approximately2E + s2S +

sM + 12(

s2

k)M + 1

2(sk)S + 2kM. If we setk = ⌈√s ⌉ (the optimal choice), we have2E + s2S + sM +

12(s⌈

√s ⌉)M + 1

2(s⌈√s ⌉)S + 2⌈√s ⌉M. If p is a priori known, we obtain2E + s

2S + sM + 12(s⌈

√s ⌉)M +

12(s⌈

√s ⌉)S + 2⌈√s ⌉M. Like in the average case, we will needs precomputed elements and memory for

just 2⌈√s ⌉ additional eleme nts. Consequently, both the average-caseand the worst-case complexity ofour improved algorithm are inO((log2 p)

3.5).

5.4. Comparisons with Other Methods

In this section we will compare our algorithm with the most important square root algorithms, namelyTonelli-Shanks and Cippola-Lehmer. After a short overviewof these algorithms, we will put forward acomputational comparison of the three algorithms.

5.4.1. Tonelli-Shanks Algorithm

The Tonelli-Shanks algorithm ([22], [20]) reduces the problem of computing a square root to anotherfamous problem, namely thediscrete logarithmproblem - given a finite cyclic groupG, a generatorαof it, and an arbitrary elementβ ∈ G, determine the uniquek, 0 ≤ k ≤ |G| − 1, such thatβ = αk. Theelementk will be referred to as the discrete logarithm ofβ in baseα, denoted byk = logα β. Althoughthis problem is intractable, if the order of the group issmooth, i.e., its prime factors do not exceed a givenbound, there is an efficient algorithm due to Pohlig and Hellman [16].

Let us consider an odd primep, p = 2st+ 1, with s ≥ 2 andt is odd,a a quadratic residue andd aquadratic non-residue (modulop). Tonelli-Shanks algorithm is based on the following simple facts:

1. Letα = dt. Then| < α > | = 2s, or, equivalently,ord(α) = 2s, where< α > denotes thesubgroup induced byα, andord(α) represents the order ofα (in Z

∗p).

2. Letβ = at. Thenβ ∈< α > andlogα β is even (this discrete logarithm is considered with respectto the subgroup induced byα).


Thus, if we can determinek such thatβ = αk, then√a can be computed as

√a = a

t+12 (d−1)

k2t.

Indeed,(at+12 (d−1)

k2t)2 = at+1(dkt)−1 = at+1a−t = a.

Thus, the difficult part is findingk, the discrete logarithm ofβ in baseα (in the subgroup< α >of order2s). Tonnelli and Shanks compute the elementk bit by bit. Lindhurst [14] has proven thatTonelli-Shanks algorithm requires on average two exponentiations, s

2

4 multiplications, and two quadraticcharacter evaluations, with the worst-case complexityO((log2 p)

4).

Bernstein [5] has proposed a method of computingw bits of k at a time. His algorithm involvesan exponentiation ands

2

2w2 multiplications, with a precomputation phase that additionally requires twoquadratic character evaluations on average, an exponentiation, and about2w s

wmultiplications, producing

a table with2w sw

precomputed powers ofα.

5.4.2. Cippola-Lehmer Algorithm

The following square root algorithm is due to Cipolla [6] andLehmer [12]. Cipolla’s method is based onarithmetic in quadratic extension fields, which is briefly reminded below.

Let us consider an odd primep anda a quadratic residue modulop. We first generate an elementz ∈ Z

∗p such thatz2 − a is a quadratic non-residue. The extension fieldZp(

√z2 − a) is constructed as

follows:

• its elements are pairs(x, y) ∈ Z2p;

• the addition is defined as(x, y) + (x′, y′) = (x+ x′, y + y′);

• the multiplication is defined as(x, y) · (x′, y′) = (xx′ + yy′(z2 − a), xy′ + x′y);

• the additive identity is(0, 0), and the multiplicative identity is(1, 0);

• the additive inverse of(x, y) is (−x,−y) and its multiplicative inverse is(x(x2 − y2(z2 − a))−1,−y(x2 − y2(z2 − a))−1).

Cipolla has remarked that a square root ofa can be computed using that

(z, 1)p+12 = (

√a, 0),

and his method requires two quadratic character evaluations on average and at most6 log2 p multiplica-tions ([7, page 96]).

Lehmer’s method is based on evaluating Lucas’ sequences. Let us consider the sequence(Vk)k≥0

defined byV0 = 2, V1 = z, andVk = zVk−1 − aVk−2, for all k ≥ 2, wherez ∈ Z∗p is generates such

thatz2 − 4a is a quadratic non-residue. Lehmer has proved that

√a =

1

2V p+1

2,

and his method requires two quadratic character evaluations on average and about4.5 log2 p multiplica-tions ([18]). Muller [15] has proposed an improved variantthat requires only2 log2 p multiplications,which will be referred to as the Improved Cipolla-Lehmer.


5.4.3. Tests Results

We have implemented Improved Generalized Atkin (Imp-Gen-Atk) and the fastest known algorithms,namely Tonelli-Shanks-Bernstein (Ton-Sha-Ber) and Improved Cipolla-Lehmer (Imp-Cip-Leh). For allpairs (log2 p, s), log2 p ∈ {128, 256, 512, 1024}, s ∈ {4, 8, 16, log2 p

2 }, we have generated32 pairs(p, a), wherea is a quadratic residue modulop and we have counted the average number of modularsquarings and regular modular multiplications. We have considered two cases, depending whetherp isknown a priori or not. We have not included the computation required for finding a quadratic non-residuemodulop. For exponentiation we have considered the simplest method, namely the square-and-multiplyexponentiation. In case of an exponentx, this method requireslog2 x squarings andHw(x) regularmultiplications.

For Improved Generalized Atkin we choose the optimalk = ⌈√s⌉, requirings + 2⌈√s⌉ storedvalues. For Tonelli-Shanks-Bernstein, given that the number of needed precomputed values iss2w

w, in

order to reach a number of elements comparable with ours, we choose the parameterw = 2 (that leadsto 2s elements). We have to remark that the performance of Improved Cipolla-Lehmer does not dependon s.

We present the results for the case thatp is not known a priori in Tables 1-4. In each column thefirst value indicates the average number of squarings and thesecond one denotes the average number ofregular multiplications.

Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 256 / 138 512 / 262 1024 / 515 2048 / 1036

Ton-Sha-Ber 255 / 146 511 / 300 1023 / 562 2047 / 1076

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 1. Comparison between methods fors = 4, wherep is unknown

Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 260 / 136 515 / 267 1027 / 521 2052 / 1031

Ton-Sha-Ber 255 / 179 511 / 300 1023 / 562 2047 / 1076

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020


Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 270 / 138 527 / 272 1039 / 527 2062 / 1041

Ton-Sha-Ber 255 / 284 511 / 412 1023 / 668 2047 / 1178

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020



Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 303 / 171 559 / 297 1070 / 551 2096 / 1059

Ton-Sha-Ber 253 / 301 509 / 426 1021 / 686 2045 / 1195

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020


Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 392 / 217 904 / 526 2084 / 1334 5096 / 3721

Ton-Sha-Ber 253 / 723 509 / 2468 1021 / 9021 2045 / 34431

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 5. Comparison between methods fors =log2 p

2, wherep is unknown

In case thatp is not known a priori, Improved Cipolla-Lehmer is clearly the best, while our algorithmis comparable with Tonelli-Shanks-Bernstein.

We are interested in determining the values ofs for which our algorithm is more efficient than Im-proved Cipolla-Lehmer and/or Tonelli-Shanks-Bernstein considering the case thatp is known a priori.We express 1E aslog2 p S + log2 p−s

2 M. To simplify the comparisons we no longer distinguish betweensquarings and regular multiplications.

More precisely, let us first determines such that our algorithm is more efficient than ImprovedCipolla-Lehmer in terms of total computation:

log2 p+log2 p− s

2+

3s

4+

s

2+

1

4(s⌈

√s ⌉) + 1

4(s⌈

√s ⌉) + ⌈

√s ⌉ < 2 log2 p

We obtain the following sequence of equivalent inequalities:

log2 p+log2 p− s

2+

s⌈√s ⌉2

+5s

4+ ⌈

√s ⌉ < 2 log2 p

s⌈√s ⌉2

+3s

4+ ⌈

√s ⌉ <

log2 p

2

s⌈√s ⌉+ 3s

2+ 2⌈

√s ⌉ < log2 p

We now turn our attention to Tonelli-Shanks-Bernstein withthe parameterw = 2. A more thorough

analysis of this algorithm gives uslog2 p+log2 p− s

2+

s2

8+

3s

2multiplications.

We obtain the following inequality:

s⌈√s ⌉2

+5s

4+ ⌈

√s ⌉ <

s2

8+

3s

2

which leads tos > 20.


We present the results for the case thatp is known a priori in Tables 5-8. We remind the reader thatin each column the first value indicates the average number ofsquarings and the second one denotes theaverage number of regular multiplications.

Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 128 / 72 256 / 133 512 / 269 1024 / 520

Ton-Sha-Ber 126 / 71 254 / 136 510 / 265 1022 / 522

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 6. Comparison between methods fors = 4, wherep is known a priori

Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 131 / 80 260 / 136 515 / 276 1028 / 522

Ton-Sha-Ber 127 / 112 255 / 176 511 / 305 1023 / 559

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020


Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 141 / 79 270 / 155 526 / 262 1038 / 524

Ton-Sha-Ber 126 / 125 254 / 187 510 / 316 1022 / 574

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020


Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 176 / 116 308 / 182 560 / 317 1072 / 560

Ton-Sha-Ber 126 / 246 254 / 310 510 / 441 1020 / 697

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020



Methodlog2 p 128 256 512 1024

Imp-Gen-Atk 268 / 186 649 / 453 1597 / 1257 4043 / 3425

Ton-Sha-Ber 126 / 686 254 / 2399 510 / 8895 1022 / 34172

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 10. Comparison between methods fors =log2 p

2, wherep is known a priori

6. Conclusions and Future Work

In this paper we have extended Atkin’s algorithm to the general casep ≡ 2s+1mod 2s+1, for anys ≥ 2,thus providing a complete solution for the casep ≡ 1mod 4. Complexity analysis and comparisons withother methods have also been provided.

An interesting problem is extending our algorithm to arbitrary finite fields. In the case of the finitefieldsGF(pk), for k odd, the efficient techniques described in [11], [9] can be adapted to our case in astraightforward manner, but, to the best of our knowledge, there are no similar techniques for the caseGF(pk), for k even. We will focus on this topic in our future work.

Acknowledgements

We would like to thank the two anonymous reviewers for their helpful suggestions.

References

[1] Ankeny, N. C.: The Least Quadratic Non Residue,Annals of Mathematics, 55(1), 1952, 65–72.

[2] Atkin, A.: Probabilistic primality testing (summary by F. Morain), Technical Report 1779, INRIA, 1992,URL:http://algo.inria.fr/seminars/sem91-92/atkin.pdf.

[3] Atkin, A., Morain, F.: Elliptic Curves and Primality Proving, Mathematics of Computation, 61(203), 1993,29–68.

[4] Bach, E., Shallit, J.:Algorithmic Number Theory, Volume I: Efficient Algorithms, MIT Press, 1996.

[5] Bernstein, D. J.: Faster square roots in annoying finite fields (preprint), 2001,URL:http://cr.yp.to/papers/sqroot.pdf.

[6] Cipolla, M.: Un metodo per la risoluzione della congruenza di secondo grado,Rendiconto dell’Accademiadelle Scienze Fisiche e Matematiche, Napoli, 9, 1903, 154–163.

[7] Crandall, R., Pomerance, C.:Prime Numbers. A Computational Perspective, Springer-Verlag, 2001.

[8] Eikenberry, S., Sorenson, J.: Efficient Algorithms for Computing the Jacobi Symbol,Journal of SymbolicComputation, 26(4), 1998, 509–523.

[9] Han, D.-G., Choi, D., Kim, H.: Improved Computation of Square Roots in Specific Finite Fields,IEEETransactions on Computers, 58(2), 2009, 188–196.


[10] IEEE Std 2000-1363. Standard Specifications For Public-Key Cryptography, 2000.

[11] Kong, F., Cai, Z., Yu, J., Li, D.: Improved generalized Atkin algorithm for computing square roots in finitefields, Information Processing Letters, 98(1), 2006, 1–5.

[12] Lehmer, D.: Computer technology applied to the theory of numbers,Studies in number theory(W. Leveque,Ed.), 6, Prentice-Hall, 1969.

[13] Lemmermeyer, F.:Reciprocity Laws. From Euler to Eisenstein, Springer-Verlag, 2000.

[14] Lindhurst, S.: An analysis of Shanks’s algorithm for computing square roots in finite fields, in:Numbertheory(R.Gupta, K. Williams, Eds.), American Mathematical Society, 1999, 231–242.

[15] Muller, S.: On the Computation of Square Roots in Finite Fields,Designs, Codes and Cryptography, 31(3),2004, 301–312.

[16] Pohlig, S., Hellman, M.: An improved algorithm for computing logarithms overGF(p) and its cryptographicsignificance,IEEE Transactions on Information Theory, 24, 1978, 106–110.

[17] Pomerance, C.: The Quadratic Sieve Factoring Algorithm, Advances in Cryptology: Proceedings of EURO-CRYPT 84(T. Beth, N. Cot, I. Ingemarsson, Eds.), 209, Springer-Verlag, 1985.

[18] Postl, H.: Fast evaluation of Dickson Polynomials, in:Contributions to General Algebra(D. Dorninger,G. Eigenthaler, H. Kaiser, W. Muller, Eds.), vol. 6, B.G. Teubner, 1988, 223–225.

[19] Schoof, R.: Elliptic Curves Over Finite Fields and the Computation of Square Roots modp, Mathematics ofComputation, 44(170), 1985, 483–494.

[20] Shanks, D.: Five number-theoretic algorithms,Proceedings of the second Manitoba conference on numericalmathematics(R. Thomas, H. Williams, Eds.), 7, Utilitas Mathematica, 1973.

[21] Sze, T.-W.: On taking square roots without quadratic nonresidues over finite fields,Mathematics of Computa-tion, 80(275), 2011, 1797–1811, (a preliminary version of this paper has appeared as arXiv e-print, availableat http://arxiv.org/abs/0812.2591v3).

[22] Tonelli, A.: Bemerkung uber die Auflosung quadratischer Congruenzen,Gottinger Nachrichten, 1891, 344–346.

Date post:	24-Feb-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

A Complete Generalization of Atkin’s Square Root Algorithmsiftene/fi125(1)04.pdf · 2014. 5....

Documents