Home >
Documents >
pdfs.semanticscholar.orgpdfs.semanticscholar.org/3086/d8318239d7a1aa8c93a2387e33dc0d… · PARALLEL...

Share this document with a friend

Embed Size (px)

of 47
/47

Transcript

PARALLEL COMPLEXITY OF COMPUTATIONS WITHGENERAL AND TOEPLITZ-LIKE MATRICESFILLED WITH INTEGERS AND EXTENSIONS∗

VICTOR Y. PAN†

SIAM J. COMPUT. c© 2000 Society for Industrial and Applied MathematicsVol. 30, No. 4, pp. 1080–1125

Abstract. Computations with Toeplitz and Toeplitz-like matrices are fundamental for manyareas of algebraic and numerical computing. The list of computational problems reducible to Toeplitzand Toeplitz-like computations includes, in particular, the evaluation of the greatest common divisor(gcd), the least common multiple (lcm), and the resultant of two polynomials, computing Padeapproximation and the Berlekamp–Massey recurrence coefficients, as well as numerous problemsreducible to these. Transition to Toeplitz and Toeplitz-like computations is currently the basis forthe design of the parallel randomized NC (RNC) algorithms for these computational problems.

Our main result is in constructing nearly optimal randomized parallel algorithms for Toeplitzand Toeplitz-like computations and, consequently, for numerous related computational problems(including the computational problems listed above), where all the input values are integers and allthe output values are computed exactly. This includes randomized parallel algorithms for computingthe rank, the determinant, and a basis for the null-space of an n×n Toeplitz or Toeplitz-like matrixA filled with integers, as well as a solution x to a linear system Ax = f if the system is consistent.Our algorithms use O((logn) log(n log ‖A‖)) parallel time and O(n logn) processors, each capable ofperforming (in unit time) an arithmetic operation, a comparision, or a rounding of a rational numberto a closest integer. The cost bounds cover the cost of the verification of the correctness of theoutput. The computations by these algorithms can be performed with the precision of O(n log ‖A‖)bits, which matches the precision required in order to represent the output, except for the rankcomputation, where the precision of the computation decreases. The algorithms involve either asingle random parameter or at most 2n− 1 parameters.

The cited processor bounds are less by roughly factor n than ones supported by the knownalgorithms that run in polylogarithmic arithmetic time and do not use rounding to the closestintegers.

Technically, we first devise new algorithms supporting our old nearly optimal complexity esti-mates for parallel computations with general matrices filled with integers. Then we decrease dramat-ically, by roughly factor n1.376, the processor bounds required in these algorithms in the case wherethe input matrix is Toeplitz-like. Our algorithms exploit and combine some new techniques (whichmay be of independent interest, e.g., in the study of parallel and sequential computation of recursivefactorization of integer matrices) as well as our earlier techniques of variable diagonal (relating toeach other several known algebraic and numerical methods), stream contraction, and the truncationof displacement generators in Toeplitz-like computations; our development and application of thesetechniques may be of independent interest.

Key words. parallel algorithms, randomized algorithms, Toeplitz matrix computations, Toeplitz-like matrices, polynomial gcd, displacement rank, computational complexity, block Gauss–Jordandecomposition, p-adic lifting, Newton–Hensel’s lifting

AMS subject classifications. 68Q22, 68Q25, 68Q40, 65Y20, 47B35, 65F30

PII. S0097539797349959

1. Introduction.

1.1. Toeplitz and Toeplitz-like matrices and some applications. The fastversion of Euclidean algorithm [AHU74], [BGY80] computes the greatest common di-

∗Received by the editors January 18, 1999; accepted for publication (in revised form) March 20,2000; published electronically August 29, 2000. The results of this paper have been presented atthe ACM–SIAM Workshop on Mathematics of Numerical Analysis: Real Numbers Algorithms, ParkCity, Utah, 1995.

http://www.siam.org/journals/sicomp/30-4/34995.html†Department of Mathematics and Computer Science, Lehman College, City University of New

York, Bronx, NY 10468 ([email protected]). The work of this author was supported byNSF grants CCR 9020690 and CCR 9625344 and PSC CUNY Award 666327.

1080

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1081

visor (gcd) of two polynomials of degrees at most n by using O(n(log n)2) arithmeticor field operations (that is, additions, subtractions, multiplications, and divisions),but to yield substantial parallel acceleration, one has to reduce the problem to the so-lution of the associated (possibly singular) Toeplitz-like (resultant or subresultant) orToeplitz linear system of O(n) equations, Tx = f [BGH82], [G84], [BP94]. (The con-cepts of Toeplitz and Toeplitz-like matrices are well known (see [KKM79], [CKL-A87],[BP94], pp. 47–48, 138–141, 148–151), but for the reader’s convenience, we recall theirdefinitions below. (Also see Definitions 2.18 and 13.2 in sections 2 and 13.))

The gcd computation is only one (though celebrated) example of various majorproblems of algebraic and numerical computing whose solution is reduced to solv-ing Toeplitz or Toeplitz-like linear systems of equations. The list of such problemsincludes the computation of the resultant, the Sturm and subresultant sequences,and the least common multiple (lcm) for a pair of univariate polynomials ([BT71],[BGY80], [BP94], sections 2.8–2.10), as well as the shift register synthesis and linearrecurrence computation [Be68], [Ma75], inverse scattering [BK87], adaptive filter-ing [K74], [H91], modelling of stationary and nonstationary processes [KAGKA89],[KVM78], [K87], [L-AK84], [L-AKC84], numerical computations for Markov chains[Ste94], Pade approximation of an analytic function [BGY80], polynomial interpola-tion and multipoint evaluation [PSLT93], [PZHY97], solution of partial differentialand integral equations [Bun85], [C47/48], [KLM78], [KVM78], parallel computationswith general matrices over an arbitrary field of constants [P91], [P92], [KP91], [KP92],approximating polynomial zeros [P95], [P96a], [P97], and the solution of polynomialsystems of equations [EP97], [MP98], [BMP98].

Furthermore, the general reduction techniques of [P90] enable us to extend thealgorithms available for Toeplitz and Toeplitz-like computations to computations withsome other major classes of structured matrices, such as Cauchy-like and Vandermonde-like matrices, also highly important in many areas of computing [PSLT93], [H95],[GKO95], [PZHY97], [OP98], [OP99], [P00], [P00a].

The design of new effective algorithms for parallel solution of Toeplitz and Toeplitz-like linear systems will be our major goal. For the reader’s convenience, we will nextbriefly recall the definitions and some basic properties of Toeplitz and Toeplitz-likematrices. (See section 13 and Definition 2.18 of section 2 for more details.)

T = (ti,j) is an n× n Toeplitz matrix if

ti,j = ti+1,j+1 for i, j = 0, 1, . . . , n− 2,(1.1)

that is, if the entries of T are invariant in their shifts into the diagonal direction.Toeplitz matrices are easy to store, since such an n × n matrix is fully representedby the 2n− 1 entries of its first column (or row, respectively) and its last column (orrow). Multiplication of an n× n Toeplitz matrix by a vector can be reduced to threefast Fourier transforms (FFTs) (e.g., via its reduction to polynomial multiplicationmodulo x2n−1 (see [BP94], p. 133)) and can be performed by using O(n log n) arith-metic operations. Hereafter, arithmetic operations, as well as comparisons of pairsof rational numbers and the rounding of a rational number to a closest integer, arereferred to as ops.

Due to the structural properties of Toeplitz matrices, one may solve a nonsingularToeplitz linear system of n equations by using O(n(log n)2) ops [BGY80], [Morf80],[BA80], [Mu81], [dH87], [AGr88], [K95]. (Note that we would need storage spacen2 + n and 2n2 − n ops to multiply a general matrix by a vector and order of nd opswith d > 2 to solve a general nonsingular linear system of n equations.)

1082 VICTOR Y. PAN

The above properties are extended to the class of n × n Toeplitz-like matrices,that is, ones represented in the form

T =

�∑i=1

LiUi,(1.2)

where Li and UTi are n×n lower triangular Toeplitz matrices, UTi is the transpose ofUi, and � is bounded by a fixed constant, � = O(1). (Note that any n× n matrix canbe represented in the form (1.2) for � ≤ n.) It suffices to store the 2�n entries of thefirst columns of Li and UTi for i = 1, . . . , � in order to represent T . These 2� columnsform a pair of n× � matrices called a displacement generator of T of length �.

Representation (1.2) for a Toeplitz-like matrix enables us to manipulate with O(n)entries of its displacement generator (rather than with its n2 entries). Furthermore,we may immediately multiply a matrix T of (1.2) by a vector by using O(�n log n) ops,and also we may solve a linear system Tx = f in O(�2n(log n)2) ops if T is nonsingular[Morf80], [BA80], [Mu81].

We have � ≤ 2 in (1.2) for Toeplitz matrices, their inverses, and resultant ma-trices, and � ≤ g + h for g × h block matrices with Toeplitz blocks. Furthermore,the transposition of a matrix leaves � invariant, whereas � may grow only slowly inmultiplication and addition/subtraction of pairs of matrices and stays unchanged orgrows only nominally in the inversion of a nonsingular matrix (see section 13).

1.2. NC and RNC solutions (some background). Due to their reductionto Toeplitz/Toeplitz-like linear systems, several computational problems listed in theprevious section, including the gcd, lcm, and resultant computation and Pade ap-proximation, can be solved by using O(n(log n)d1) ops, where n is the input size, andd1 ≤ 3. (We may need to allow d1 = 3 in order to handle singular Toeplitz/Toeplitz-like linear systems; we reduce their solution to computing the rank of the coefficientmatrix and to the subsequent solution of a nonsingular Toeplitz-like linear system.)Like the Euclidean algorithm, however, such solution algorithms require an order ofn parallel steps.

The known alternative algorithms yield NC or randomized NC (RNC) solutionsof all the cited computational problems [BGH82], [G84], [BP94], that is, yield theirsolution by using t(n) = O((log n)c) time and p(n) = O(nd) arithmetic processors, fortwo fixed constants c and d, under the customary exclusive read exclusive write ran-dom access machine (EREW PRAM) arithmetic model of parallel computing [KR90],[J92]. (Alternatively, we may define the NC and RNC solutions as the families ofarithmetic, Boolean, or arithmetic-Boolean circuits for the above problems havingdepths O((log n)c) and sizes O(nd) for two fixed constants c and d [G86].) Indeed,the NC/RNC solution of a linear system of n equations can be computed over any fieldof constants [Cs76], [Be84], [Ch85], [KP91], [KP92]. These algorithms, however, leaveopen the important problem of processor efficiency of (R)NC Toeplitz and Toeplitz-like computations, that is, of having the ratios p(n)/T+(n) or even p(n)/T−(n) atthe level O((log n)c1) for a constant c1, where T+(n) and T−(n) denote the recordupper and lower bounds on the sequential time of the solution, respectively. Indeed,on the one hand, we have already cited the bound T+(n) = O(n(log n)3), and, clearly,T−(n) ≥ n. On the other hand, p(n) has order nd for d > 3 in [Cs76], [Be84], [Ch85]and for d > 2 in [KP91], [KP92], for solving general linear systems in NC/RNC,whereas p(n) has order nd for d ≥ 2 for the known NC/RNC Toeplitz/Toeplitz-likesolvers over any field of constants [P92], [KP94], [P96], [P96b]. To yield NC/RNC

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1083

and processor efficiency, we must decrease d to the optimal level 1. An approachtoward this goal was outlined in [BP94, p. 357], incorporating various nontrivial tech-niques developed earlier for computations with general and dense structured matrices[Morf80], [BA80], [P85], [P87], [P92], [P92b], [P93], [P93a], and our objective in thepresent paper is to show in detail how this can be done, under certain assumptionson the model of computing.

1.3. The model of computing. Our main assumption is that the input con-sists of integers (for the reduction from a real input, one may use binary or decimalchopping followed by scaling) and that rounding a rational number to a closest inte-ger, as well as an arithmetic operation or comparison of two rationals, are allowed asunit cost operations. The bit-precision of these computations will be bounded at theoptimal level of the output precision, so that we achieve solution at a low Booleancost.

Stating our estimates for the computational cost, we will let OA(t, p) and OB(t, p)denote the simultaneous bounds O(t) on the parallel time and O(p) on the numberof arithmetic or Boolean processors, respectively. We will routinely decrease theprocessor bounds slightly, by exploiting the B-principle of parallel computing, whichis a variant of Brent’s principle and according to which O(s) time-steps of a singlearithmetic or Boolean processor may simulate a single time-step of s arithmetic or,respectively, Boolean processors [KR90], [PP95]. According to the B-principle, thebound OA(t, kp) implies the bound OA(tk, p), and similarly OB(t, kp) implies thebound OB(tk, p) for a parameter k ≥ 1. (For p = 1, we arrive at sequential timebounds OA(tk, 1) and OB(tk, 1).) By applying the well-known technique based on theB-principle, one may slow down the computations at the stages requiring too manyprocessors. In many cases this increases the time bound only by a constant factorbut more substantially decreases the processor bound; a celebrated example is thesummation of n values, where application of the B-principle decreases the asymptoticcost bound from OA(log n, n) to OA(log n, n/ log n) (cf. [Q94, pp. 44–46]; [BP94,pp. 297–298]). (The converse trade-off of time and processor bounds is not generallypossible, but for almost all matrix computations that we consider and, more generally,for any task of the evaluation of a set of multivariate polynomials, one may alwaystransform an NC/RNC algorithm into one using O(log2 n) time and O(nd) arithmeticprocessors for some finite but generally quite large constant d [VSBR83], [MRK88].We will not use the latter result as we are concerned about processor bounds.)

Remark 1.1. Our algorithms for Toeplitz/Toeplitz-like computations are es-sentially reduced to computing the convolutions (which can be performed via FFT,assuming that the 2hth roots of 1 are available) and the inner products of pairs of vec-tors. These basic operations (for vectors of a dimension n) can be performed at thecost OA(log n, n) and OA(log n, n/ log n), respectively, under both the EREW PRAMmodel and more realistic models such as hypercube, butterfly and shuffle-exchange pro-cessor arrays [Le92], [Q94]. Thus, it is possible to implement our algorithms efficientlyassuming the latter models.

1.4. Our main results. The algorithms of this paper extend our previous workon parallel computations with general matrices [P85], [P87], [P93a], [BP94] (cf. also[PR91], [P92a], [P93b], [PR93]) by means of incorporation of some techniques de-veloped in [P90], [P92], [P92b], [P93], [P93a] for computations with Toeplitz andToeplitz-like matrices. As a result, we arrive at RNC algorithms for the most funda-mental computations with the latter classes of matrices filled with integers (such asthe computation of their ranks, null-spaces and determinants and solving linear sys-

1084 VICTOR Y. PAN

tems of equations). These algorithms yield optimal (up to polylogarithmic factors)time and processor bounds, which improves by factor n the processor bounds of theknown RNC algorithms. By using the known reduction to Toeplitz and Toeplitz-likecomputations, we also extend our results to yield similar nearly optimal upper boundson the time and processor complexity (also achieving order of n improvement versusthe known RNC algorithms) for many other related computations (e.g., the compu-tation of polynomial gcd and lcm and Pade approximation), where the input valuesare integers.

We will emulate the historic line, by first treating the case of general matricesand then improving the algorithms in the Toeplitz/Toeplitz-like case. We will startwith recalling the record parallel complexity bounds of [P85], [P87] for computationswith a general n × n input matrix; we will give their alternative derivation. Statingthese bounds in Theorem 1.1 below, we will use the value ω satisfying 2 ≤ ω < 2.376and such that a pair of n × n matrices can be multiplied at the arithmetic costOA(log n, n

ω). We note that the magnitudes of det A and the integer entries of adjA = A−1 det A can be as large as ||A||n or ||A||n−1, which means the output precisionof an order of n log ||A||. We ensure that the precision of the computations by ouralgorithms does not exceed this level. Furthermore, we compute the rank of A byusing even a lower precision, which enables some decrease of the Boolean cost of thecomputation of the rank.

Technically, we will largely follow the cited outline, given by us in [BP94, chapter4], and combine a variety of the known techniques, in particular ones developed in[P85], [P87], [P92b], [P93], [P93a], and some new ones (such as the combination ofprimal and dual recursive decompositions of an integer matrix with the objective tobound the magnitude of the intermediate and output values). The required combi-nation of all these techniques is highly nontrivial and never was presented in eithercomplete or accessible form.

The main purpose of our paper is to give such a presentation or, formally speak-ing, to give complete and accessible proofs of the two theorems below. The first ofthem only handles the case of general integer input matrices (in this case our paral-lel complexity results repeat ones of [P85], [P87], except for the presently improvedBoolean cost of the rank computation), but we use distinct alternative proof, whichshould be technically interesting in its own right and is fully used in our subsequentrelatively simpler extension to the Toeplitz/Toeplitz-like case, handled by our secondtheorem.

Theorem 1.1. Let A be a k × h matrix and let f be an h-dimensional vector,both matrix and vector filled with integers that range from −2a to 2a for some a > 1.Let k + h = O(n). Then, with an error probability of at most n−c for a fixed positiveconstant c, one may compute r, the rank of A, at a randomized computational costbounded by OA((log n) log(n log a), n

ω) and by

OB((log n)(log(n log a))2 log log(n log a), nω+1 log(na)).

Furthermore, one may compute the determinant of A and, if A is nonsingular, thenalso the inverse of A and the solution to a linear system Ax = f , all of them at arandomized computational cost bounded by OA((log n) log(na), n

ω) and by

OB((log n)(log(na))2 log log(na), (log n)(a+ log n)nω+1/ log(na)).

If A is an n×n singular matrix, the latter bounds also apply to the computation of n−rbasis vectors of a null-space of A and a solution x to a linear system Ax = f provided

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1085

that this system is consistent. The same cost bounds apply to testing correctnessof the computed value of r=rank A as well as of all other output values. In thesecomputations, 2n− 1 random parameters are used for computing rank A, detA, andthe null-space of A, and a single random parameter is used for all other tasks includingthe computation of |detA|. The above complexity estimates do not cover the cost ofgeneration of the random parameters.

In the case of a k × h Toeplitz or Toeplitz-like input matrix (defined in section1.1 and also in sections 2 (Definition 2.18) and 13 (Definition 13.2)), an extensionof our approach yields much smaller (by factor nω−1/ log n) upper bounds on theprocessor complexity of the same computations (with no increase of the asymptotictime-bounds).

Theorem 1.2. Under the assumptions of Theorem 1.1, let the input matrix A be aToeplitz matrix or a Toeplitz-like matrix. Then all the processor complexity estimatesof Theorem 1.1 can be decreased by factor nω−1/ log n, preserving the time bounds,to yield the randomized parallel complexity bounds OA((log n) log(n log a), n log n),OB ((log n)(log(n log a))2 log log(n log a), (n2 log n) log(na)), and OA((log n) log(na),n log n), OB((log n)(log(na))

2 log log(na), (a+ log n)(n log n)2/ log(na)), respectively.Here, the inverse of A and the basis matrix for the null-space of A are assumed to beoutput in the form of their displacement generators.

We refer the reader to Remark 12.2 on possible minor refinement of the estimatesof both theorems.

Due to substantial economization of computational resources in our algorithms forToeplitz/Toeplitz-like computations, they may become practically efficient providedthat they are supported by subroutines for multiprecision parallel computations withintegers and polynomials and by the development of the interface between algebraicand numerical computing, both required in our algorithms. Such a development ismotivated by various potential benefits, our algorithms is but one of many examples.The practical implementation of our algorithms for general n × n matrices faces aharder problem of the storage of n2 long integers in the computer memory (versus2n−1 in the Toeplitz case), and this task becomes practically infeasible at some pointas n increases.

Our algorithms do not improve the known sequential algorithms for Toeplitz andToeplitz-like computations [BGY80], [Morf80], [BA80], [Mu81], [dH87], which run innearly optimal arithmetic time of O(n log2 n), but some of our techniques may beof practical and theoretical interest for sequential computations too. In particular,our Toeplitz–Newton iteration techniques are effective for rapid practical improve-ment of approximate solution of Toeplitz and Toeplitz-like linear systems of equa-tions [PBRZ99], and our study of integral version of recursive decomposition as wellas our bounds on the growth of the auxiliary integers (particularly, of the auxiliarydeterminants) is a natural but nontrivial extension of the Bareiss version of Gaus-sian elimination (cf. [B68]). Even our simple idea of the precision decrease in therandomized computation of the rank (by performing the computation modulo a fixedprime) leads to a substantial decrease of the sequential Boolean time bounds (forboth general and Toeplitz/Toeplitz-like matrices). The latter trick also applies to theclosely related problem of the computation of the degree of the gcd and lcm of twopolynomials with integer coefficients.

1.5. Extensions. We already cited [BGY80], [G84], [P92], [P96b], and [BP94]on the reduction of the computation of the gcd, the lcm, and the resultant of two poly-nomials as well as Pade approximation of a formal power series or of a polynomial—to

1086 VICTOR Y. PAN

the computation of the rank of a Toeplitz/Toeplitz-like matrix and solving a nonsin-gular Toeplitz/Toeplitz-like linear system of equations. The integrality of the inputcan be preserved in this reduction, and the input size may grow by a factor of at most2. Therefore, the computational complexity estimates of Theorem 1.2 are immediatelyextended to the listed problems of the gcd, lcm, resultant and Pade computations (as-suming the restrictions on the size and integrality of the input), as well as to variouscomputational problems reducible to the latter ones.

Furthermore, we refer the reader to [P90], on the general techniques that immedi-ately enable extension of our results of Theorem 1.2 to computations with Cauchy-likeand Vandermonde-like input matrices, to [BP93], [BP94] on the extension to the caseof matrices represented as the sums of Hankel-like and Toeplitz-like matrices, andto [BGY80], [BP94], [P96b], [PSLT93], [PZHY97], and other references cited in thebeginning of this paper on various applications of the computations with Toeplitz-likeand other structured matrices (see also Remark 14.1).

Among possible extensions of Theorem 1.1, consider the case where the inte-ger matrix A is symmetric positive definite, sparse, and associated with an s(n)-separatable graph given with its s(n)-separator family (cf. [LRT79], [P93b], [PR93]).If such a matrix A is well conditioned (even if its entries are not integers but any realnumbers), then, at the arithmetic cost OA((log n)

3, (s(n))ω/ log n), the parallel algo-rithm of [P93b], [PR93] numerically computes both recursive factorization of such amatrix and its determinant, as well as a solution x = A−1f to a linear system Ax = f(if detA �= 0). Numerical approximation is involved in this algorithm at the auxiliarystages of matrix inversions, where a parallel algorithm of [PR89] is applied. If A isfilled with integers, then this stage can be performed exactly, by using the algorithmsof [P85], [P87]. Then, the exact recursive factorization of A, detA, and A−1f canbe computed at the arithmetic cost OA((log n)

3, (s(n))ω + n). By employing the al-gorithm of this paper for recursive decomposition and inversion of a general integermatrix, one may improve the latter bounds a little, to yield OA((log n)

2, (s(n))ω+n).

The results of Theorems 1.1 and 1.2 can be further extended to various othermatrix computations by using the known reduction techniques of [BP94], [P96],[P96b]. For demonstration, consider the computation of the characteristic polynomialcA(x) = det(xI−A) of the above sparse n×n matrix A. Such a polynomial has degreen. We may first concurrently compute cA(x) at n + 1 distinct points x0, . . . , xn andthen obtain its coefficients by interpolation. If the chosen values of xi are larger thann‖A‖, then the matrices xiI−A are positive definite, and we may compute cA(xi) fori = 0, 1, . . . , n, at the overall computational cost OA((log n)

2, ((s(n))ω + n)n). Thesebounds dominate the cost of the subsequent interpolation producing the polynomialdet(xI −A).

As another example, the algorithms of [BP94, p. 357], for Pade approximationand polynomial gcd have been used in [P95] and [P96a] in order to obtain the recordparallel arithmetic complexity estimates for approximating polynomial zeros. Ourpresent improvement of these results of [BP94] in Theorem 1.2 immediatley impliesthe respective minor improvement of the results of [P95] and [P96a].

Corollary 1.3. Given a positive b and the coefficients of an nth degree monicpolynomial with zeros z1, . . . , zn satisfying maxi|zi| ≤ 1, one may compute approxima-tions z∗1 , . . . , z

∗n to z1, . . . , zn satisfying |z∗i −zi| < 2−b, i = 1, . . . , n; the computation is

randomized; its arithmetic cost is bounded by OA((log n)3((log n)2+log(b+2)), n

log n ).

1.6. Outline of the method. A major ingredient of our approach is the vari-able diagonal method of [P85], [P87], which combines several algebraic and numerical

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1087

techniques to yield effective parallel inversion of a matrix A filled with integers. Themethod includes Newton’s iteration, which effectively solves the latter problem pro-vided that a good initial approximation to A−1 is available. Such an approximation isnot available, however, for a general integer matrix A. The recipe of [P85], [P87] is toinvert at first the auxiliary matrix F = V −apI, where V = A mod p, I is the identitymatrix of an appropriate size, p is a prime, p ≥ n, and a is a sufficiently large integer.(We follow this recipe and show that it suffices to choose a = 10pn2 in our case.)Then the matrix −I/(ap) is a good initial approximation to F−1, which we rapidlyimprove by Newton’s iteration, until F−1 is approximated closely enough. Since F isan integer matrix, detF and the entries of adj F = (detF )F−1 are integers, whichcan be recovered by rounding their approximations within absolute errors less than1/2. This gives us (detA) mod p, (adj A) mod p, and A−1 mod p. Then the algebraictechnique of p-adic (Newton–Hensel’s) lifting is applied. In � steps, for a sufficientlylarge �, the matrix A−1 mod p is lifted to A−1 mod pL, L = 2l. Then the lifting ofA−1 mod p is extended to lifting similarly (det A) mod p and (adj A) mod p. Finally,det A and adj A are easily recovered from (detA) mod pL and (adj A) mod pL.

The remaining ingredient is the approximation of detF . In [P85], [P87], this isachieved as a by-product of solving the more general task of computing det(xI − F ),the characteristic polynomial of F . In the present paper, we employ a more routineapproach, based on the computation of recursive (block) decomposition (RD) of F(cf. [St69], [Morf74], [Morf80], [BA80], [P87]) or, equivalently, on the computationof nested Schur’s complements, also called Gauss transforms (cf. [C74], [F64]). Asingle recursive step of this approach is the decomposition of the input matrix (rep-resented as a 2× 2 block matrix) into the product of a block diagonal matrix and twoblock triangular matrices (see (2.3)). Such a decomposition can be obtained by blockGauss–Jordan elimination and can be reduced to a few matrix multiplications (theirparallel implementation is simple) and inversions (they are made simple by Newton’siteration, since good initial approximations are given by matrices −I/(ap)). As in[P85], [P87], random choice of a large prime p in a fixed large interval enables us toavoid degeneration and singularities (with a high probability).

An important point, as in [P85], [P87], is that in spite of computing all thematrix inverses approximately, we finally recover them exactly (as well as all theother matrices involved in the RDs) by exploiting the representation of their entriesas the ratios of integers. To emplasize this point, we called the resulting RDs theintegral RDs (IRDs). The only remaining nontrivial problem in the computation ofthe IRD of F and det F is to bound the magnitudes of the integers involved. In thepresent paper, this problem is solved based on the computation of the dual RD, that is,the RD of F−1. (The celebrated techniques of [B68] do not suffice, and their extensionto our case is nontrivial not just because we deal with recursive block decomposition,rather than with the more customary Gaussian elimination, but also because we needto control the magnitudes of the determinants of the matrices involved in the RD,which is a much harder problem, and we use the dual RD in order to solve it.)

As soon as the IRDs of F and F−1 are available, we obtain the RDs modulo pof A and A−1. Now, we apply the techniques of p-adic (Newton–Hensel’s) lifting notonly to A−1 mod p but to the entire RDs modulo p of A and A−1, in order to obtainthe RDs modulo pL of A and A−1 for L = 2l and a sufficiently large l. (detA) modpL is recovered from such an RD of A. As det A is an integer, we recover easily detA and then the matrix adj A = A−1 detA, whose entries are integers.

This approach only gives us an alternative derivation of the estimates of [P85],

1088 VICTOR Y. PAN

[P87] for parallel complexity of some fundamental computations with general matri-ces. The new algorithms, however (unlike ones of [P85], [P87]), have an advantageof allowing their effective extension to Toeplitz/Toeplitz-like cases. Indeed, manip-ulation with displacement generators, rather than with matrices themselves, enablesthe decrease of the processor complexity of the RNC algorithms outlined above to theoptimal level linear in n.

The nontrivial problem in such a Toeplitz/Toeplitz-like extension is the control ofthe length of the displacement generators in the process of Newton’s iteration. (Un-controlled growth of the length would immediately imply the growth of the processorbounds by factor nω−1/ log n for ω > 2.375.) We solve this problem by applying twotechniques of truncation of generators (TG), which we borrow from [P92] and [P92b],[P93], [P93a], respectively.

The above outline was essentially given by us in [BP94, chapter 4]. Presently, wealso add the technique of stream contraction specified in section 10 (and, essentially,being the pipelining of the two processes of RD and Newton’s iteration) borrowed from[PR91]. Stream contraction enables additional acceleration of our algorithms by factorlog n. (Using the technique of stream contraction for the acceleration of Toeplitz-likecomputations was also proposed in [R95], though the algorithms of [R95] did not giveany improvement of the processor bounds in the Toeplitz/Toeplitz-like case versusthe much larger bounds known in the case of general input matrices (see our Remarks6.1, 11.1, and 14.2 and our similar comments in [P96b])).

1.7. Organization of the paper. The order of our presentation will slightlydiffer from the one outlined above. After some preliminaries in section 2, we willintroduce the RD and extended RD (ERD) of a matrix in section 3. In section 4,we define the IRD and show the transition from RD to IRD. We recall an algorithmfor approximate matrix inversion via Newton’s iteration in section 5 and apply it inorder to approximate the RD and the IRD of an integer matrix in section 6. Weestimate the errors and parallel complexity of these computations in sections 7–9. Weapply pipelining (stream contraction) to achieve acceleration by factor logn in section10, extend the results of sections 6 and 10 to computing the ERD modulo a fixedprime in section 11, and use p-adic lifting to recover (from the ERD) the inverse, thedeterminant, the rank, and the null-space of an integer matrix (thus proving Theorem1.1) in section 12. In section 13, we recall some known definitions and properties forcomputations with Toeplitz and Toeplitz-like matrices. In section 14, we apply theseproperties to improve the results of section 12 in the Toeplitz and Toeplitz-like cases(thus proving Theorem 1.2). Section 15 is left for a brief discussion.

2. Some definitions and auxiliary results for matrix computations. Wewill next recall some customary definitions and well-known basic properties of generalmatrices.

Definition 2.1 (matrix notation). I and 0 denote the identity and null matrices,of appropriate sizes. WT is the transpose of a matrix or vector W . diag (wi)

n−1i=0 =

diag (w0, . . . , wn−1) is the diagonal matrix whose diagonal is filled with w0, . . . , wn−1;D(W ) = diag (W ) = diag (wi,i)

n−1i=0 for a matrix W = (wi,j). rank W is the rank of

W . det W is the determinant of a square matrix W . adj W is the adjoint (adjugate)matrix of W , equal to W−1 det W for a nonsingular matrix W .

In our error analysis, we will use the customary vector and matrix norms [GL89/96].Definition 2.2 (vector and matrix norms). ‖ v ‖=‖ v ‖1=

∑i |vi|, ‖ v ‖2=

(∑i v

2i )

1/2 for a real vector v = (vi). ‖ W ‖g= max‖v‖g=1 ‖ Wv ‖g, g = 1, 2;‖ W ‖=‖ W ‖1 for a matrix W .

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1089

Proposition 2.3 (norm bounds). ‖ W ‖=‖ W ‖1= maxj∑i |wi,j | for a matrix

W = (wi,j). Furthermore, if W is a k × k matrix and V is its submatrix, then

||V ||g ≤ ||W ||g, g = 1, 2;

||W ||/k1/2 ≤ ||W ||2 ≤ ||W ||k1/2.

Proof. To prove the bound ||V ||g ≤ ||W ||g, note that ||Ww||g ≥ ||V v||g if v isa subvector of w and if w has zero components corresponding to the columns of Wthat are not in V . Other claimed relations can be found in [GL89/96].

We will also use the following known fact (cf. [GL89/96] or [BP94]).Proposition 2.4 (bounds on the determinant and the entries of the adjoint

matrix). Let W be a k × k matrix. Then |det W | ≤ (‖ W ‖g)k, and furthermore,|v| ≤ (‖ W ‖g)k−1 for every entry v of adj W , where g = 1, 2.

Definition 2.5 (column-diagonally dominant (c.-d.d.) matrices). d(W ) =‖ WD−1(W ) − I ‖. A matrix W is column-diagonally dominant (hereafter, we willuse the abbreviation c.-d.d.) if d(W ) < 1.

Definition 2.6 (leading principal submatrix (l.p.s.) and its Schur complement).For a k× k matrix W , let W (q) denote its q× q northwestern or l.p.s., formed by theintersection of the first q rows and the first q columns of W , q = 1, 2, . . . , k. If B isa nonsingular l.p.s. of W and if

W =

(B CE G

),(2.1)

then the matrix

S = S(W,B) = G− EB−1C(2.2)

is called the Schur complement of B in W .The Schur complement S of (2.2) can be obtained by Gaussian or block Gaussian

elimination applied to the matrix W , provided that the elimination process can becarried out (cf. [GL89/96, P3.2.2, p. 103]). In particular, it is easily verified thatfor k > q the Schur complement of B = W (q) in a k × k matrix W is the (k − q) ×(k− q) matrix obtained from W in q steps of Gaussian elimination (without pivoting)provided that these steps can be carried out (with no division by 0). The latterassumption holds, in particular, if W is a c.-d.d. matrix.

By applying block Gauss–Jordan elimination to the 2 × 2 block matrix W of(2.1), with a nonsingular block B, we obtain the following decomposition, which willbe fundamental for our study:

W =

(I 0

EB−1 I

) (B 00 S

) (I B−1C0 I

).(2.3)

If a matrix W is nonsingular, then (2.3) implies that the matrix S is also nonsin-gular. By inverting the matrices on both sides of (2.3), we obtain that

W = W−1 =

(I −B−1C0 I

) (B−1 00 S−1

) (I 0

−EB−1 I

).(2.4)

Equation (2.4) immediately implies the following proposition.Proposition 2.7. Under (2.1)–(2.4), the matrix S−1 is the trailing principal

(that is, southeastern) submatrix of W = W−1.

1090 VICTOR Y. PAN

Our algorithms will rely on the decompositions of (2.3), (2.4), recursively appliedto the matrices B,B−1, S, and S−1, which we will call RDs. The next definitionsand results will cover the nonsingularity properties required for the existence of suchrecursive extension of (2.3), (2.4) and some other relevant properties of l.p.s.’s andSchur complements (the s.p.d. matrices will be used only at the very end of section12).

Definition 2.8 (strongly nonsingular matrices). A matrix W is strongly non-singular if all its leading principal submatrices are nonsingular.

Definition 2.9 (symmetric positive definite (s.p.d.) matrices). A real matrixM is s.p.d. if it can be represented as the product AAT for a nonsingular matrix A.

The next proposition (cf. [BP94, exercise 4c, p. 212]), extends strong nonsingu-larity and the s.p.d. property to an l.p.s. and a Schur complement.

Proposition 2.10. If a matrix W of (2.1) is strongly nonsingular (respectively,if W is s.p.d.), then so are its every l.p.s., including the matrix B of (2.1), and theSchur complement S of B, defined by (2.2).

Corollary 2.11. Any s.p.d. matrix is strongly nonsingular.By recursively applying block Gaussian elimination at first to the block matrix

W of (2.1) with B = W (r) and then to W (r), with the l.p. (that is, leading principalor northwestern) block W (q), q < r, we obtain the following.

Proposition 2.12 (transitivity of Schur’s complementation). If r > q and ifW (r) and W (q) are nonsingular matrices, then S(W,W (q)) = S(S(W,W (r)),S(W,W (r))(r−q)).

We also easily deduce the following.Proposition 2.13 (transitivity of the c.-d.d. property). If d(W ) < 1 for a

matrix W of (2.1), then B, S, and W are nonsingular matrices, d(B) ≤ d(W ) < 1,d(S) ≤ d(W ) < 1, and (2.3)–(2.4) hold.

The following result, together with Propositions 2.4 and 2.12, will be basic forour bounds on the values involved in recursive decompositions.

Proposition 2.14. Assuming (2.1)–(2.2), every entry of the matrix S det B isa subdeterminant (that is, the determinant of a submatrix) of the matrix W .

Proof. Let B = W (q). Consider the l.p. (that is, northwestern) entry s0,0 of S.By Proposition 2.12, it is the Schur complement of B in the submatrix W (q+1) ofW . By Proposition 2.7, s−1

0,0 = detB/detW (q+1). Therefore, s0,0 detB = detW (q+1),which proves the proposition for s0,0. To extend this result to any entry si,j of S,interchange the ith and 0th rows and the jth and 0th columns of S and the respectivepairs of rows and columns of W .

Clearly, the matrix adj B = B−1 det B is filled with integers ifW is. Proposition2.14 (or, alternatively, (2.2)) implies the similar property of the matrix S det B. Wesummarize these observations for future references.

Proposition 2.15 (integrality of adjoints and scaled Schur complements). If amatrix W of (2.1) is filled with integers, then so are the matrices adj B = B−1det Band S det B.

Recursive application of the next result enables us to recover detW from recursivedecomposition of W .

Proposition 2.16 (factorization of the determinants and submatrices impliedby matrix decomposition). Matrix equation (2.3) implies that det W = (det B) detS and, furthermore, that

W (t) =

(I 0

EB−1 I

)(t) (B 00 S

)(t) (I B−1C0 I

)(t)

for t = 1, . . . , k ,

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1091

and det W (t) = (det B) det S(t−q) for t = q + 1, . . . , k, provided that W is a k × kmatrix.

In section 12, we will also use the following definitions and known results.Definition 2.17 (the null-space of a matrix; cf. [GL89/96] or [BP94]). The

null-space N(A) of a matrix A is the linear space formed by all vectors x satisfyingthe vector equation Ax = 0.

Fact 2.1. Two vector equations, Ax = f and Ay = f , together imply thatx − y ∈ N(A), or, equivalently, any solution x to a consistent linear system Ax = fcan be represented in the form x = x0 + z, where x0 is a fixed specific solution andz ∈ N(A).

Toeplitz matrix computations will be studied in sections 13 and 14, but also theproposition below involves Toeplitz matrices and is needed in section 12.

Definition 2.18 (Toeplitz matrices). T = (ti,j) is a k × k Toeplitz matrix ifti+1,j+1 = ti,j for i, j = 0, 1, . . . , k − 2 (cf. (1.1)), that is, if the entries of T areinvariant in their shifts in the diagonal direction. (Such a matrix is defined by itstwo columns (or rows)—the first one and the last one.) A square lower triangularToeplitz matrix is defined by its first column u and is denoted L(u). Z = Zk = (zi,j)is a k × k lower triangular Toeplitz matrix with the first column (0, 1, 0, . . . , 0)T , sothat zi+1,i = 1 for i = 0, 1, . . . , k − 2, zi,j = 0 if i �= j + 1.

Proposition 2.19 ([KS91], (cf. [BP94, Lemmas 1.5.1 and 2.13.1])). Let S bea fixed finite set of cardinality |S|. Let A, L, and U be n × n matrices, let rankA = r, let L and UT be unit lower triangular Toeplitz matrices, each defined by then − 1 entries of its first columns. Let these entries be chosen from S at random,independently of each other, under the uniform probability distribution on S. Thenthe matrix (UAL)(r) is strongly nonsingular with a probability at least 1−(r+1)r/|S|.

3. RD and ERD of a c.-d.d. matrix. With minor deviation from the orderof our outine of section 1.6 but in accordance with section 1.7, we will next study theRD, then, in section 4, the integral RD (IRD), and in section 5, matrix inversion.

Hereafter, for convenience, let log stand for log2, let n = 2h for an integer h =log n, and let V be a fixed n× n c.-d.d. matrix.

We will define an RD of such a matrix W = V based on its representation in theform (2.3) for q = n/2. We will first apply (2.3) to W = V and then, recursively, toW = B and W = S, and so on, though in fact, we will mostly care about the diagonalblocks V, V0 = B, V1 = S, and so on, which we will identify with the nodes of abinary tree, T . Similarly, we define the dual RD of W = W−1 based on the recursiveapplication of (2.4) to W = B−1 and W = S−1. (We will study such a dual RD insection 11 and will use it in section 12.)

The node of T associated with a binary strings α of length |α| is a k × k matrix,denoted by Vα, where k = n/2|α|. The root of the tree is the n × n matrix V = VΛ,associated with the empty string Λ. For a string β of length less than h, we let β0and β1 denote the two strings obtained by appending 0 and 1 to β, respectively.(We use the two characters α and β to distinguish between the two classes of binarystrings—of length at most h and less than h, respectively.) We will assume that thematrix equations (2.1)–(2.4) are satisfied for W = Vβ , B = Vβ0, S = Vβ1, and for anybinary string β of length |β| < h. The resulting RD of the matrix V continues up tothe level h, where it reaches its leaves-matrices Vα = (vα) of size 1× 1, where |α| = h,and where vα denotes the single entry of Vα.

Proposition 3.1. All the nodes Vα of the tree T are c.-d.d. matrices; moreover,d(Vα) ≤ d(V ) < 1 for all binary strings α.

1092 VICTOR Y. PAN

Proof. Recall that V is a c.-d.d. matrix and recursively apply Proposition2.13.

Let us formalize the computation of the RD of V by extending the notation(2.1)–(2.2) to Vα. We will write

Vβ =

(Bβ CβEβ Gβ

),(3.1)

Vβ0 = Bβ , Vβ1 = Sβ = Gβ − EβB−1β Cβ ,(3.2)

for all binary strings β of length less than h.Then, computation of the RD of a c.-d.d. matrix V amounts to recursive compu-

tation of Vβ1 of (3.2) for all binary strings β of length increasing from 0 to h−1. As aby-product, the computation produces the matrices V −1

β0 = B−1β for all binary strings

β of length less than h. By appending these inverse matrices V −1β0 to the nodes Vβ0

of the tree T (and, consequently, to the RD of V ), we arrive at the ERD of V . Theset of the matrices V −1

β0 will be called the extending set of the RD of V .The RD and the ERD of any matrix V can be defined as long as all the involved

nodes-matrices Vβ0 for all the binary strings β of length less than h are nonsingular.Recursive application of Proposition 2.10 and Corollary 2.11 yields the following.

Proposition 3.2. There exists the RD and the ERD of any strongly nonsingular(in particular, of any s.p.d.) matrix.

Remark 3.1. As soon as we have the RD of a c.-d.d. matrix V , we immediatelyobtain the RD of V −1 based on recursive application of (2.4). Having such an RDavailable, we may compute the solution x = V −1f of a linear system V x = f , at alower computational cost OA((log n)

2, n2/(log n)2).

4. IRD of a c.-d.d. matrix filled with integers. Suppose that the inputmatrix V is filled with integers. Then, for all binary strings α, the matrices Vα arefilled with rationals, and there exist integer multipliersmα such thatmαVα are integermatrices. We will next specify a particular choice of such integer multipliers mα.

We will use the notation W (q) of Definition 2.6 and the following definition.Definition 4.1. (α)2 denotes the binary value represented by a binary string α

(of length at most h). α(q) denotes the binary string that represents a nonnegativebinary value q, so that (α(q))2 = q. H(α) denotes 2h−|α| = n/2|α|. Q(α) denotes(α)2H(α).

By applying Proposition 2.12, we obtain the following.Proposition 4.2. Let a binary string α end with bit 1 and have length at most

h. Then the matrix Vα is the Schur complement of V (Q(α)) in V (Q(α)+H(α)).By combining this proposition with Proposition 2.14, we obtain the following.Proposition 4.3. For any binary string α = β1γ of length at most h with γ

being a string of zeros, let

mα = det V (Q(β1)).(4.1)

Then the entries of the matrix Vαmα are the subdeterminants (that is, the determi-nants of some submatrices) of the matrix V . In particular, if V is filled with integers,then so are the matrices Vαmα for all α, |α| ≤ h.

Now we replace the matrix Vαγ by the pair of the scalar mα = mαγ and thematrix mαVαγ , in the binary tree T , for every pair of binary strings α and γ such that

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1093

α ends with 1, |α|+ |γ| ≤ h, and γ consists only of zeros. This gives us the IRD of ac.-d.d. integer matrix V . By definition, we will also include into the IRD of V the twosets, {detV (k), k = 1, . . . , n} and {detVα, |α| ≤ h}, of the determinants associated tothe RD of V . Clearly, having the IRD of V available, we may immediately computethe RD of V . Later in this section, we will specify a simple transition from the RD tothe IRD. In section 12, we will also compute a dual IRD by similarly extending thedual RD.

By combining Propositions 2.4 and 4.3, we obtain that for all binary stringsα, |α| ≤ h, we have

|mα| ≤‖ V ‖n, ||mαVα|| ≤ n||V ||n.(4.2)

For a c.-d.d. integer matrix V given together with its RD, we will seek its IRD.We recall that vα for |α| = h denotes the single entry of the 1×1 leaf-matrix Vα of thetree T of the RD. Recursive application of Propositions 2.12 and 2.16 immediatelyyields the two following results.

Proposition 4.4. For every binary string α of length at most h, we have

det Vα =∏β

vβ ,

where∏β denotes the product in all binary strings β of length h that have α as their

prefix; that is, the associated nodes Vβ are both leaves of the tree T and descendantsof the node Vα in the tree T.

Proposition 4.5. det V (q) =∏

(α)2<qvα, where

∏(α)2<q

denotes the product

in all binary strings α of length h for which (α)2 < q.

By applying the well-known parallel prefix algorithm [EG88], [KR90], we deducethe following result from Propositions 4.2–4.5 and 2.15.

Corollary 4.6. Given the RD of a c.-d.d. n× n matrix V filled with integers,one may compute the IRD of V at the cost OA(log n, n/ log n).

Due to the latter result and to matrix equations (2.1)–(2.3), the computation ofthe IRD of V can be reduced to a sequence of multiplications, inversions, and subtrac-tions of integer matrices, and we obtain the following parallel arithmetic complexityestimates:

tIRD(n) ≤ tI(n/2) + tIRD(n/2) + 2tM (n/2) + 1,(4.3)

pIRD(n) ≤ max{pI(n/2), 2pIRD(n/2), 2pM (n/2), n2},(4.4)

whereOA(tIRD(k), pIRD(k)), OA(tI(k), pI(k)), andOA(tM (k), pM (k)) denote the timeand processor bounds for the computation of the IRD, the inverse and the product ofk × k matrices, respectively, where

(OA(tM (k), pM (k)) = OA(log k, kω), 2 ≤ ω < 2.376(4.5)

[BP94], and where we also use the obvious complexity bounds OA(tS(k), pS(k)) =OA(1, k

2) for the subtraction of k × k matrices.

As the computation of the IRD is our major goal, relations (4.3)–(4.5) motivatethe next subject of our study, that is, matrix inversion.

1094 VICTOR Y. PAN

5. Approximate matrix inversion via Newton’s iteration.

Algorithm 5.1. Newton’s iteration for approximate matrix inversion.Input: a nonsingular k × k matrix B, two positive scalars b and c, b > c, and a

matrix X0 (a rough initial approximation to −B−1) such that

c = − log ‖ BX0 + I ‖ .(5.1)

Output: a matrix X such that

‖ BX − I ‖≤ 2−b(5.2)

and, consequently,

‖ X −B−1 ‖≤ 2−b ‖ B−1 ‖ .(5.3)

Computations:1. Compute

g = �log(b/c)�.(5.4)

2. Recursively compute the matrices

Xi = Xi−1(2I +BXi−1) , i = 1, . . . , g.(5.5)

3. Output the matrix X = −Xg.To prove correctness of Algorithm 5.1, deduce from (5.5) that

I +BXi = (I +BXi−1)2 , i = 1, 2, . . . , g,

and, consequently,

I +BXi = (I +BX0)2i

,

‖ I +BXi ‖≤‖ I +BX0 ‖2i

,(5.6)

for i = 1, 2, . . . , g. In particular, for i = g, we have

‖ I +BXg ‖≤‖ I +BX0 ‖2g

= 2−c2g ≤ 2−b

due to (5.1) and (5.4). This gives us (5.2) and (5.3) for X = −Xg.To estimate the overall computational cost of performing Algorithm 5.1, observe

that the ith step (5.5) amounts to two matrix multiplications and to adding, at thecost OA(1, k), the matrix 2I to the matrix BXi−1. Summarizing, we obtain thefollowing result.

Proposition 5.1. For g of (5.4), g steps (5.5) of Newton’s iteration, performedat the overall cost OA(gtM (k), pM (k)) = OA(g log k, k

ω) for ω of (4.5), 2 ≤ ω < 2.376,suffice in order to compute a matrix X = −Xg satisfying (5.2) and (5.3).

Remark 5.1. Proposition 5.1 enables us to estimate the complexity of approxi-mate matrix inversion in terms of a scalar b and a matrix X0. Their choice dependson the input matrix B and is critical for estimating the approximation errors and thenumber of iterations. We will elaborate this choice in the next sections. Here are some

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1095

preliminary comments on the choice of X0. If B is a c.-d.d. matrix so that d(B) < 1,then (5.1) holds for X0 = −D−1(B) and c = − log d(B); that is, ||BX0 + I|| ≤ d(B).This gives us a good policy for the choice of X0 and c over the class of c.-d.d. matricesB. In this paper, however, we will only need to invert the c.-d.d. matrices B thatare close to the scaled identity matrices −mI for a fixed large integer m. The inverseof such a matrix B is well approximated by the scaled identity matrices X0 = −I/msatisfying

||BX0 + I|| < 1/(5n2), c > 2 log n.(5.7)

In fact our algorithms and complexity estimates will remain valid under some assump-tions that are weaker than (5.7). Say, the bound

c = − log ||BX0 + I|| > θ > 0(5.8)

for a fixed constant θ would suffice. The choice of X0 = D−1(B) also satisfies theerror bound of (5.7), but X0 = −I/m is a Toeplitz matrix (see Definition 2.18), whichwill be a crucially important advantage for the proof of Theorem 1.2 in sections 13and 14.

6. Approximate RD of an integer matrix and its extension to the exactevaluation of the IRD. Algorithm 5.1 is intended as matrix inversion block in thealgorithms of sections 3–4 for computing the ERD, IRD, and the associated determi-nants of a c.-d.d. integer matrix V . Then, the matrices Vα and the values of det V (q)

and det Vα for all q and α are computed approximately, even where we still performall arithmetic operations over the rationals, with infinite precision and no errors. Ournext goal is to yield the exact IRD, assuming that we fixed a sufficiently large b anddefined X0 and c according to Remark 5.1.

Let us write Vα, det V(q), and det Vα for the computed approximations to Vα,

det V (q), and det Vα, respectively. Then we extend the relations (3.1), (3.2), (5.1),(5.4), and (5.5) by writing

Vβ =

(Bβ CβEβ Gβ

),(6.1)

Vβ0 = Bβ , Vβ1 = Gβ − EβXβCβ ,(6.2)

for all binary strings β of length less than h, where the policy of defining Xβ,0 (inaccordance with Remark 5.1) will be specified later on, and where

c(β) = − log ‖ BβXβ,0 + I ‖,(6.3)

g(β) = log(b/c(β)),(6.4)

Xβ,i = Xβ,i−1(2I + BβXβ,i−1) , i = 1, . . . , g(β),(6.5)

Xβ = −Xβ,g(β),(6.6)

and I is the identity matrix of an appropriate size.

1096 VICTOR Y. PAN

Algorithm 6.1. Approximating the RD of a c.-d.d. matrix.Input: a positive integer h, n = 2h, a positive b, positive integers g(α) for all

binary strings α of length at most h, and an n× n c.-d.d. matrix V = VΛ.Output: a set of matrices Vα for all binary strings α of length at most h, satis-

fying (6.1)–(6.6), and the values det Vα for all α and det V (q), q = 1, . . . , n, definedaccording to Proposition 4.5 with det and v replaced by det and v, respectively.

Computations: Starting with V = VΛ for the empty string Λ, recursively apply(6.1)–(6.6) and a fixed policy of defining Xβ,0, in order to compute Vβ0 and Vβ1 forall binary strings β of length less than h. (For binary strings β of length h − 1, thematrices Vβ0 and Vβ1 have size 1 × 1, so Vβ0 is inverted immediately, and then the

matrix Vβ1 is computed based on (2.1) and (2.2) for W = Vβ .) Finally, compute

det Vα and det V (q) for all α and q by applying Propositions 4.4 and 4.5, under theabove modification of the notation.

Correctness of the algorithm is immediately verified, provided that the valueb is chosen sufficiently large so that all matrices Bβ—which approximate the c.-d.d.matrices Bβ—still have the property of being c.-d.d. and that the matrices X0 = Xβ,0are chosen satisfying (5.8) for B = Bβ for all binary strings β. We note that (5.6)

and (6.4) together imply that ||I + BαXα|| ≤ 2−b and, consequently,

||Xα − B−1α || ≤ 2−b||B−1

α ||,(6.7)

which extends (5.3).The computational cost is bounded by OA((log n)

2g, nω) for g = max|β|<h g(β)and for ω of (4.5). (Recursively apply (4.3)–(4.5) and Proposition 5.1.) A desiredupper bound on g (and, consequently, on the parallel time) will be ensured by (5.8),(6.3), (6.4), and appropriate choice of b. Such a choice and its analysis will be shownin the next sections. g(β) will in fact be independent of β, that is, we will chooseg(β) = g for all β.

Next, for a c.-d.d. matrix V filled with integers, we will apply Algorithm 6.1for a sufficiently large b, and then we will apply the techniques of integer rounding(compare [P85], [P87], and [BP94, p. 252]), to extend the resulting approximate RDof V to the evaluation of the IRD of V . To yield this extension, we will choose b inAlgorithm 6.1 sufficiently large to ensure the following bounds:

|det V (q) − det V (q)| < 1/2 , q = 1, . . . , n ,(6.8)

‖ Vβ1 − Vβ1 ‖<‖ V ‖−n /2 for all binary strings β, |β| < h,(6.9)

where det V (q) and Vβ1 denote the approximations to det V (q) and Vβ1, respectively,computed by Algorithm 6.1 for the fixed value of b.

Under (6.8) and (6.9), we recover the IRD of V as follows.Algorithm 6.2.Input: a set {det V (q), q = 1, . . . , n} of approximations to det V (q) for all q and

an approximate RD of an n × n matrix V filled with integers, such that (6.8) and(6.9) hold.

Output: the IRD of V .Computations:1. Round the values det V (q) to the closest integers; output the resulting integer

values of det V (q), q = 1, . . . , n.

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1097

2. Compute the matrices Wβ1 = Vβ1 det V (Q(β1)) and round their entries to theclosest integers; output the resulting integer matrices Wβ1 = Vβ1 det V (Q(β1)

for all binary strings β of length less than h.3. For all binary strings β of length less than h and all binary strings γ filled with

zero bits and satisfying |β1γ| ≤ h, output the matricesWβ1γ = Vβ1γ detV(Q(β1))

(cf. (4.1) and Definition 4.1).

Correctness of Algorithm 6.2 follows since, clearly, the values det V (q) are integersfor all q and since the matrices Wβ1 = Vβ1 det V (Q(β1)) are filled with integers forall β (due to Proposition 4.3). The computational cost of performing the algorithm isbounded by OA(1, n

2).

Remark 6.1. Instead of choosing the multipliers mβ1 based on Proposition 4.3,one may follow the more straightforward recipe of [R95] and recursively define mβ byusing induction on |β| and by writing mβ1 = mβ det Vβ0. Then, however, the orderof log |mβ | grows from |β| (compare our bounds (4.2)) to |β|2, and the bit-precisionand the bit-complexity of the computations grow by the extra factor n. (The statementof Proposition 5.1 of [R95] is false. Its proof relies on an erroneous claim that if amatrix mA is filled with integers, then so is the matrix m adj A; this claim is false,say, for m = 3 and the matrix A = diag (1/3, . . . , 1/3).)

7. Errors of the approximation of the RD and the transition from theRD to the IRD for a c.-d.d. matrix. We are going to implement the next stepof the outline of section 1.6 by specifying a c.-d.d. matrix V , whose IRD will give usthe IRD of A modulo a prime p. We recall that, according to Definition 2.1, we writeI to denote the identity matrices of appropriate sizes. We will next specify (in termsof n and ‖V ‖) a choice of the input parameter b of Algorithm 6.1 that will enable usto satisfy the relations (6.8) and (6.9), where V is an n × n matrix of the followingclass.

V = F −mI.(7.1)

F is an n× n matrix filled with nonnegative integers that are less than a fixed primep ≥ n (we will work with F = A mod p for an input matrix A), and

m = 10p2n2.(7.2)

Remark 7.1. The choice of a larger m would have made V more strongly diag-onally dominant (which is what we would like to have) but would have involved largerintegers, which would have increased the Boolean cost of the resulting computations,so we choose only a moderately large m. In fact, our construction allows us to chooseeven a little smaller m.

Next, let us prove that the entries of the matrices V of this class and of allmatrices Vα of their RDs satisfy the following rough estimate, which will suffice forour purpose.

Proposition 7.1. The entries of the matrices Vα +mI lie in the range between−1/2 and p− 1/2 for all binary strings α of length at most h = log n.

Proof. By the definition of the matrix V , the entries of the matrix F = V +mIrange from 0 to p−1. By Proposition 4.2, it suffices to prove that the entries of everySchur complement S of an l.p.s. B = V (q) in V range from −1/2 to p − 1/2. SinceS = G− EB−1C (assuming (2.1) for W = V ) and since the entries of the submatrixG+mI of F range from 0 to p− 1, it suffices to prove that the entries of the matrix

1098 VICTOR Y. PAN

EB−1C range between −1/2 and 1/2. Since the matrices F (q) = B +mI, C, and Eare submatrices of F , their entries also range from 0 to p− 1. Therefore,

||C|| < (p− 1)n, ||E|| < (p− 1)n,

−mB−1 = (I − F (q)/m)−1 =

∞∑i=0

(F (q)/m)i,

||B−1|| ≤ (1/m)(1/(1− a)), a = ||F (q)||/m < (p− 1)n/m < 0.03.

Consequently, ||B−1|| < 2/m, ||EB−1C|| ≤ 2(p− 1)2n2/m < 1/2.Hereafter, we will write

w = m+ (p− 1)n.(7.3)

Here are three corollaries of Proposition 7.1; the first and the third of them areimmediate.

Corollary 7.2. Let |α| = h, so that Vα = (vα) is a 1× 1 matrix. Then we have

|vα| < m+ p.

Corollary 7.3. ‖ V −1β0 ‖< 1.1/m for all binary strings β of length at most

h− 1.Proof. Write Fβ0 = Vβ0 + mI. Due to Proposition 7.1, we have ‖ Fβ0 ‖<

(p− 1)n < m/(10np). On the other hand,

V −1β0 =

1

m(I − Fβ0/m)−1 =

(1

m

) ∞∑i=0

(Fβ0m

)i.

Therefore,

‖ V −1β0 ‖≤

(1

m

) ∞∑i=0

(‖ Fβ0 ‖m

)i<

(1

m

) ∞∑i=0

1

(10np)i=

10np

(10np− 1)m<

1.1

m.

Corollary 7.4. 2cβ = ||Vβ0Xβ,0 + I|| < 1/(10pn) ≤ 1/(10n2) for Xβ,0 = −I/mand for all binary string β of length at most h− 1.

Hereafter, we will assume that n > 1 and that the matrices Xβ,0 for all β arechosen as in Corollary 7.4, so that the relations (5.8) and even (5.7) hold. Our nexttask is to estimate the desired range for b, which would enable us to recover the IRD.In this section we will prove the following basic proposition.

Proposition 7.5. Under (7.1)–(7.3), both requirements (6.8) and (6.9) are satis-fied if the matrices Vβ1 for all binary strings β of length less than h are approximatedby Algorithm 6.1 within an error norm bound

σ = 0.5/wn.(7.4)

Proof. We deduce from (7.1) and (7.3) that

‖ V ‖≤ m+ (p− 1)(n− 1) < w.

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1099

Therefore, we will satisfy (6.9) if we approximate the matrices Vβ1 within the errornorm bound (7.4). To prove that (6.8) is satisfied too, we need the next lemma.

Lemma 7.6. The requirement (6.8) is satisfied if the values vα for all binarystrings α of length h are approximated within an error bound

δ ≤ w1−n/(2n+ 2), w = m+ p < w.(7.5)

Proof of the lemma. By the virtue of Proposition 4.5, det V (q) is the product ofexactly q values vα for α denoting binary strings of length h. Under the assumptionsof Lemma 7.6, the maximum error of computing det V (q) may only increase if weassume that q = n, that vα = w, and that the approximations to vα equal w + δ forall α. Then, det V = wn is approximated by (w + δ)n, with an approximation error

E = (w + δ)n − wn = wn((1 + (δ/w))n − 1) = wn−1δ

n∑i=1

(δ/w)i−1

(ni

).

We have (n1 ) = n, (ni ) < 2n for all i, and δ/w < 1. Therefore, E < (n + (δ/w)(n− 1)2n)δwn−1.

Equations (7.2) and (7.5) together imply that

w > (n− 1)2nδ ,

and we may rewrite our bound on E as follows:

E < (n+ 1)δwn−1 .

Substitute (7.5) and obtain that E < 1/2.To complete the proof of Proposition 7.5, we observe that

σ = 0.5/wn < w1−n/(2n+ 2) < w1−n/(2n+ 2)

(compare (7.5)), and the values vα for all binary strings α of length h (except forthe string α(0) consisting of h zeros) are among the entries of the matrices Vβ1 for|β| ≤ h − 1. vα(0) is an entry of V and is known exactly without any computation.Therefore, the assumptions of Lemma 7.6 and, consequently, the requirement (6.8)are satisfied too.

8. Estimating the error accumulation and the precision of the approx-imation of the matrix inverse. In this section, we will extend Proposition 7.5 byestimating the parameter b of Algorithm 6.1 (which expresses the precision of thematrix inversion) to ensure (7.4) and, consequently, (6.8) and (6.9).

Proposition 8.1. Under some choice of b = O(n log p), the bounds (6.8) and(6.9) can be satisfied in all applications of Newton’s iteration (6.5) within Algorithm6.1, which is in turn applied to approximate the RD of a matrix V satisfying (7.1)and (7.2).

The remainder of this section is devoted to the proof of Proposition 8.1. Dueto the bound (5.6), it is actually quite clear that we would ensure (7.4) if we choosesufficiently large values b = O(n log p), g(α) of (6.4), and m of (7.1), but we willdeduce (7.4) already for m of (7.2) and g(α) = O(log(n log p)) for all α. As usuallyin the proofs involving error analysis, some tedious estimates are required. The ideaof our proof is to condense estimating the error propagation into a single step, whichwill allow its recursive extension to cover all the nodes Vα of the tree representing the

1100 VICTOR Y. PAN

RD. This basic step of our analysis will be given in the form of Proposition 8.2. (Wewill first give some preliminaries, then will state and prove this proposition, and thenwill show that its conclusion enables us to extend recursively its assumptions (and,consequently, its conclusion too) and thus to extend the error estimates recursivelyto all the descendants of the current node Vα of the tree.)

Proof of Proposition 8.1. Consider a path in the tree T from the root V to a leafVα = (vα), |α| = h. Algorithm 6.1 follows such a path by recursively proceeding frommatrices Vβ to Vβ0 and Vβ1. For given Vβ and Vβ0, the algorithm approximates thematrices V −1

β0 within the error norm bounds 2−b ‖ V −1β0 ‖ and then extends such an

approximation to approximating Vβ1. The errors of the computed approximations toVβ are accumulated in computation of all descendands of Vβ along the paths in T .We need to estimate the resulting overall errors in all the output matrices along allsuch paths, assuming that b is large though of order O(n log p).

We will next analyze a single recursive step along such a path; that is, we willfirst bound the matrix ∆(Vβ) of the initial errors of the approximation of W = Vβ ,and then we will estimate the propagated errors of the approximation of S = Vβ1caused by the combined errors due to the initial ones, given by ∆(Vβ), and ones ofNewton’s iterates for the inversion of Vβ0.

Hereafter, we will write M to denote the approximations to matricesM computedby Algorithm 6.1, for M denoting Vα (for any binary string α), a submatrix of Vα, orany other auxiliary matrix involved. We will also write

∆(M) = M −M .(8.1)

We will first estimate ‖ ∆(Vα1) ‖ in terms of ‖ ∆(Vα) ‖. For convenience, we writeW = Vα, S = Vα1, recall (2.1), (2.2), and estimate the error propagation in thetransition from W to B and S. From Proposition 7.1 and Corollary 7.3, we obtainthat

max{‖ B ‖, ‖ C ‖, ‖ E ‖, ‖ G ‖} ≤‖ W ‖≤ w ,(8.2)

‖ B−1 ‖< 1.1/m .(8.3)

We also write (cf. (8.1))

W =

(B C

E G

), ∆(W ) =

(∆(B) ∆(C)∆(E) ∆(G)

),

S = G− ELC,

where L denotes the computed approximation to B−1. Then, clearly,

max{‖ B ‖, ‖ C ‖, ‖ E ‖, ‖ G ‖} ≤‖ W ‖,(8.4)

max{‖ ∆(B) ‖, ‖ ∆(C) ‖, ‖ ∆(E) ‖, ‖ ∆(G) ‖} ≤‖ ∆(W ) ‖ .(8.5)

We will assume that the errors of the approximations obtained via Algorithm 6.1 aresufficiently small so that the following inequalities hold (also cf. the bound ||B−1|| <1.03/m obtained in the proof of Proposition 7.1):

‖ L− B−1 ‖≤ ν ‖ B−1 ‖, ν ≤ 1/(4000m2),(8.6)

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1101

‖ B−1 ‖< 1.11/m, ‖ L ‖≤ (1 + ν) ‖ B−1 ‖< 1.11(1 + ν)/m.(8.7)

Remark 8.1. The inequalities of (8.6) are reconciled with (5.3) for ν ≤ 2−b, Lreplacing X, and B replacing B.

In the next proposition we will bound approximation errors for Vβ0 and Vβ1 interms of a single positive parameter ∆ = ∆(β) defined by the errors of the approxi-mation of Vβ and by the parameter ν, which is in turn defined by the error exponent

b of Newton’s iteration for the approximation of B−1.Proposition 8.2. Suppose that the inequalities (8.2)–(8.7) hold and that a posi-

tive ∆ = ∆(β) satisfies the following bounds:

||∆(W )|| ≤ ∆, 40νm ≤ ∆ < 0.01 w,(8.8)

where W = Vβ and β is a binary string of length less than h. Then, we have(a) ‖ ∆(B) ‖=‖ ∆(Vβ0) ‖≤ ∆,(b) ‖ ∆(S) ‖=‖ ∆(Vβ1) ‖≤ 5∆.Proof. Part (a) of the proposition follows immediately since Vβ0 is a submatrix of

Vβ . To deduce part (b), we will use the following bound (implied by (8.2) and (8.8)):

‖ W ‖< 1.01 w,(8.9)

as well as the next proposition.Proposition 8.3. For any 4-tuple of k× k matrices, X, X, Y , and Y , we have(a) ‖ ∆(X ± Y ) ‖≤‖ ∆(X) ‖ + ‖ ∆(Y ) ‖,(b) ‖ ∆(XY ) ‖≤‖ ∆(X) ‖ ‖ Y ‖ + ‖ ∆(Y ) ‖ ‖ X ‖ + ‖ ∆(X) ‖ ‖ ∆(Y ) ‖,

and if X and X are nonsingular matrices, then also(c) ‖ ∆(X−1) ‖≤‖ X−1 ‖ ‖ X−1 ‖ ‖ ∆(X) ‖.Proof. The parts (a)–(c) follow immediatley from the next simple equations:(a) ∆(X ± Y ) = ∆(X)±∆(Y ),(b) ∆(XY ) = ∆(X)Y +X∆(Y ) + ∆(X)∆(Y ),(c) ∆(X−1) = X−1∆(X)X−1 = −X−1∆(X)X−1.Since S = G − EB−1C under (2.2), we will next recursively extend the bound

||∆W || on the error norms of G,E,B, and C to yield some bounds on the error normsof B−1, EB−1, EB−1C, and S.

We first apply part (c) of the latter proposition for X = B and ∆(X−1) =B−1 −B−1 and obtain that

‖ ∆(B−1) ‖≤‖ B−1 ‖ ‖ B−1 ‖ ‖ ∆(B) ‖ .

Substitute (8.3) and (8.7) into the latter bound and obtain that

‖ ∆(B−1) ‖< 1.221 ‖ ∆(W ) ‖ /m2.

Combine the relations (8.6) and (8.7) to obtain that ||L − B−1|| ≤ 1.11ν/m ≤(1.11)∆/(4000m3). Combine the latter bounds on the norms, recall (8.8), and deducethat

‖ L−B−1 ‖≤‖ ∆(B−1) ‖ + ‖ L− B−1 ‖< (1.221 + (1.11)/(4000m))∆/m2

< 1.3∆/m2 .(8.10)

1102 VICTOR Y. PAN

Apply part (b) of Proposition 8.3 for X = E, Y = L, and deduce that

‖ ∆(EL) ‖≤‖ ∆(E) ‖ ‖ L ‖ + ‖ L−B−1 ‖ (‖ E ‖ + ‖ ∆(E) ‖) .Recall from (7.2) and (7.3) that w/m ≤ 1.02. By combining the two latter

inequalities with our bounds on ‖ L ‖, ‖ L − B−1 ‖, ‖ E ‖, and ‖ ∆(E) ‖ (see(8.4)–(8.10)), obtain that

‖ ∆(EL) ‖≤(1.11

m(1 + ν) +

(1.3

m2

)1.01w

)∆ ≤ 2.5

m∆ .(8.11)

Then again, we apply part (b) of Proposition 8.3, this time for X = EL, Y = C,and obtain that

‖ ∆(ELC) ‖≤‖ ∆(EL) ‖ ‖ C ‖ + ‖ ∆(C) ‖ ‖ EL ‖ .Substitute our previous estimates (8.2), (8.4)–(8.9), and (8.11) into the latter

inequality and deduce that

‖ ∆(ELC) ‖≤((

2.5

m

)1.01w +

1.11

m(1 + ν)w

)∆ ≤ 3.7∆w/m.

Now, since w/m ≤ 1.02, we have

‖ ∆(ELC) ‖≤ 4∆.

We obtain ‖ ∆(G) ‖≤ ∆ from (8.5) and (8.8). By applying part (a) of Proposition8.3 for X = G, Y = ELC, we deduce that

‖ ∆(S) ‖≤‖ ∆(ELC) ‖ + ‖ ∆(G) ‖≤ 5∆,

which proves Proposition 8.2.Now, we observe that the assumptions of Proposition 8.2 are satisfied for W = V ,

W = V , and ∆ = 40νm, and we extend them to W = Vα for all α. The extensionfrom W = Vβ to W = Vβ0 for any β is trivial. We will comment on the extension toW = V1, which will be our sample for the extension from W = Vβ to W = Vβ1 for any

β. We write B = B = V0, L = −X1,g, X1,0 = −I/m, g ≥ 4, and define by (6.5) the

matrices X1,i for all i. Now, observe that ||I +X1,0W || = ||F ||/m, ||I +X1,0W || ≤||F ||/m+∆/m ≤ (∆ + (p− 1)n)/m < 1/(10pn) < 1/m1/2.

Therefore, (5.6) implies that

‖ Xi +B−1 ‖≤‖ B−1 ‖ /m2i−1

, i = 1, . . . , g,

and, consequently, since L = −Xg for g ≥ 4, we have

‖ L−B−1 ‖≤ ν ‖ B−1 ‖ for ν ≤ 1/m8 < 1/(4000m2n12),

thus satisfying (8.6). The remaining assumption (8.7) of Proposition 8.2 is also easilyverified (by using (8.5) and (8.8) and by following the line of the proof of Corollary7.3).

Now, we are ready to extend Proposition 8.2 recursively, which will give us thedesired upper bound on ||∆(Vα)|| in terms of b. By applying this proposition recur-sively, we extend its assumptions toW = V0,W = V1, ∆ = 40νm. (In the subsequent

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1103

recursive extension from W = Vβ1 to W = Vβ for any binary string β of length atmost h− 1, we will choose ν depending on α but satisfying (8.6) for all α.)

Apply the bounds of parts (a) and (b) of Proposition 8.2 recursively and obtainthat

‖ ∆(W ) ‖≤ ∆

|α|∑i=0

5i < n3∆

for W = Vα and all α (with |α| ≤ h = log n). Let us choose a b that enables usto reconcile the initial choice of ∆ = 40νm and the latter bound on ||∆(W )|| with(8.6)–(8.8). Recall Remark 8.1, recall that the choice of g according to (6.4) impliesthe bound (6.7) on the output approximation to the inverse, substitute ν = 2−b, andobtain the desired estimate:

‖ ∆(Vα) ‖< (40mn3∆)2−b < 22−bw2n(8.12)

for all α.

Let us choose b of order n log p satisfying the bound

b ≥ 3 + log n+ (n+ 2) logw,

which is compatible with the choice of b = log(1/ν) and with (8.6). Substitute thisbound on b into the preceding upper bound on ‖ ∆(Vα) ‖ and obtain that

‖ ∆(Vα) ‖< 0.5 w−n

for all α. This satisfies the requirement (7.4) of Proposition 7.5 and completes theproof of Proposition 8.1.

Remark 8.2. By Corollary 7.4, c(β) = O(log n) for all binary strings β of lengthat most h − 1. Furthermore, (6.4) and the above choice of b are compatible with thechoice of g = g(β) of order log b = log(n log p) for all β.

9. Computations with rounding-off: Estimates for the finite precisionand computational cost. So far, we assumed the infinite precision of computingthe RD and IRD by means of Algorithms 6.1 and 6.2. Next, we will show that thisis not necessary for obtaining the result of Proposition 8.1; that is, we will prove thefollowing.

Proposition 9.1. The estimates of Proposition 8.1 hold even if the computa-tions by Algorithms 6.1 and 6.2 are performed with a precision of b bits, for someb = O(n log p) provided that a single extra Newton’s step (6.5) is performed in eachapplication of Algorithm 6.1.

Proof. Let us first assume the computations of Newton’s step (6.5) with theinfinite precision, but in the transition from W = Vα to S = Vα1 for all binarystrings α, |α| < h, let the computations be performed with rounding to the b-bitprecision. Assuming B−1 available, the latter transition involves two multiplicationsand a subtraction of k × k matrices for k = 2h−|α| (compare (2.2)). By applyingthe techniques of backward error analysis [W65], [BL80], we bound the norm of thematrix ε(S) of the errors of the approximation to S caused by rounding:

‖ ε(S) ‖≤ nO(1)2−b ‖ W ‖ (1+ ‖ L ‖ ‖ W ‖)

1104 VICTOR Y. PAN

for L denoting the computed approximation to B−1 (cf. (8.6), (8.7)). By applyingthe relations (8.2), (8.7), (7.2), (7.3), and Proposition 7.1, we obtain that

‖ ε(S) ‖≤ mO(1)2−b .

By choosing b of order n log p, we make ‖ ε(S) ‖ less than 2 ‖ ∆(W ) ‖ for W = Vαand for all α. This is less than 40% of the upper bound that we have in part (b) ofProposition 8.2. Combining both of these bounds gives us cumulative upper bound7 ‖ ∆(W ) ‖, which shows the overall impact of the above rounding errors. This enables

us to preserve the validity of the bound (8.12) (since∑hi=0 7

i ≤ n3 for h = log n) and,consequently, of the entire proof of Proposition 8.1.

It remains to estimate the impact of rounding to b-bit precision when we performNewton’s steps (6.5). Then again, we deal with two matrix multiplications (we ignorethe errors caused by the simple addition step 2I + (WXi−1)). By applying backwarderror analysis again, we estimate that

‖ ε(Xi) ‖≤ nO(1)2−b ‖ Xi−1 ‖ (2+ ‖ W ‖ ‖ Xi−1 ‖) ,(9.1)

where ε(Xi) denotes the matrix of the errors of approximation of Xi due to roundingin performing iteration (6.5).

Our next goal is to prove the bound

‖ Xi−1 ‖≤ 1.1(1 + 1/m)/m < 1.21/m for i ≥ 1,(9.2)

ignoring for simplicity the terms of order 2−2b or less.We have from Corollary 7.4 that ‖I + BX0‖ < 1

m for m = 10pn ≥ 20p and forX0 = −I/m. Then we obtain from (5.6) that

‖ I +BXi−1 ‖≤ 1/m2i−1

.

Therefore,

‖ Xi−1 +B−1 ‖≤‖ B−1 ‖ /m2i−1

.

Consequently,

‖ Xi−1 ‖≤‖ B−1 ‖(1 +

1

m2i−1

)

for i = 1, 2, . . .. Substitute (8.3) and arrive at inequality (9.2).Substitute bound (9.2) on ‖ Xi−1 ‖ and the bound ‖ W ‖≤ w of (8.2) into (9.1),

recall (7.2) and (7.3), and obtain that

‖ ε(Xi) ‖≤ (np)O(1)2−b for all i .

By choosing a sufficiently large b, though of order n log p, we easily ensure that

‖ ε(Xi) ‖< 2−b/ ‖ B ‖ .

Therefore,

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1105

‖ I +B(Xi + ε(Xi)) ‖≤‖ I +BXi ‖ + ‖ B ‖ ‖ ε(Xi) ‖

≤‖ I +BXi−1 ‖2 +2−b .

Since Newton’s iteration (6.5) stops if ‖ I +BXi−1 ‖≤ 2−b, we may assume that‖ I +BXi−1 ‖> 2−b, so that the rounding may at worst change (6.7) into the bound

‖ I +B(Xi + ε(Xi)) ‖≤ 2 ‖ I +BXi ‖≤ 2σ ‖ I +BX0 ‖2i

< 2(2 ‖ I +BX0 ‖)2i

,

where σ =∑is=0 2

s < 2i+1. Since ‖ I + BX0 ‖≤ 1/m, the impact of the roundingon the residual norm of the output approximation computed by Newton’s iterationis more than compensated by a single extra step (5.5), (6.5), and this completes theproof of Proposition 9.1.

Let us next summarize our current complexity extimates for the computation ofthe IRD before we improve them slightly in the next section. Choose b and b of ordern log p and choose g of order log(n log p), which is consistent with (6.4) under (5.7) or(5.8). Now, by combining the results of Corollaries 4.6 and 7.3 and Propositions 8.1and 9.1 with Remark 8.2 and the estimates for the arithmetic parallel complexity ofperforming Algorithms 6.1 and 6.2 and by using the B-principle, obtain the followingcorollary.

Corollary 9.2. Algorithm 6.2 computes the IRD of a matrix V satisfying (7.1)and (7.2) at the cost OA((log n)

2 log(n log p), nω/ log n) for ω of (4.5). Furthermore,b-bit precision suffices in these computations for some b of order n log p.

10. Pipelined computation of the IRD. Our next goal is a modification ofAlgorithm 7.1, which, as we claimed in section 1.6, will enable us to improve by factorlog n the asymptotic time-complexity bounds of Corollary 9.2, without increasingthe processor bound by more than a constant factor. To achieve this goal, we willincorporate into our construction the techniques of pipelining along the lines of [PR91](where such techniques were called stream contraction and applied to computing theRD of a matrix over semirings of a certain class). Here are our informal underlyingobservations.

Algorithm 6.1 is not fully efficient because it spends substantial time and workon refining the approximations to the inverses of the matrices Bβ = Vβ0 (cf. (6.5));this delays the subsequent use of such approximations in the inversion of the matricesVβ10 = Bβ1, which anyway starts with a much cruder approximation −I/m. Next, we

will modify Algorithm 6.1. We will start the Newton process of the inversion of Bβ1by relying on the available rough approximations to B−1

β , and then we will recursivelyproduce a stream of better approximations when the process progresses.

In other words, we are going to pipeline the recursion on α (decomposition) andthe Newton one (inversion). To approximate the matrix Vβ1 of the RD and ERD, wewill start using the intermediate approximations to the inverse V −1

β0 , as soon as theyare computed by Newton’s iteration. We will update the resulting approximation toVβ1 as soon as the approximation to V −1

β0 is refined; that is, in the process of the

computation of the matrix Vβ1, we will keep refining every step of the computationsas soon as we refine its input.

More precisely, we will initialize this process by fixing some natural g, to bespecified later on. As before, α and β will denote binary strings, |α| ≤ h, |β| < h, andγ will denote the unary strings consisting of zero bits. u(α) will denote the numberof bits one in a binary string α. t will denote integers in the range from t0 to g + h.

1106 VICTOR Y. PAN

Here, t0 = u(α) in (10.1) and (10.3) (where α is fixed), t0 = u(β) + 1 in (10.4)–(10.7)(where β is fixed), and t0 = 0 elsewhere, that is, in (10.2).

We will now define the following matrices whose subscripts α, β, γ, and t rangeas specifed above:

Vα,t =

(Bα,t Cα,tEα,t Gα,t

),(10.1)

Vγ,t = Vγ(10.2)

(cf. Definition 2.6),

Vαγ,t = (Vα,t)(q) for |αγ| ≤ h, q = 2h−|αγ|,(10.3)

Xβ,t,0 =

{ −I/m for t = u(β) + 1,−Xβ,t−1 for t > u(β) + 1,

(10.4)

Xβ,t,i+1 = Xβ,t,i(2I + Vβ0,tXβ,t,i), i = 0, 1, 2, 3, 4,(10.5)

Xβ,t = −Xβ,t,4,(10.6)

Vβ1,t+1 = Gβ,t − Eβ,tXβ,tCβ,t.(10.7)

Now, we are ready to specify our pipelined algorithm.Algorithm 10.1. Stream contraction for approximating the RD.Input: natural g, h, and n = 2h; an n× n matrix V .Output: for all binary strings α of length at most h, matrices Vα,g+u(α) satisfying

the equations (10.1)–(10.7) and approximating the matrices Vα, respectively.Computations:Stage 0. Apply (10.2) for t = 0 to define the matrices Vγ,0 , |γ| = 0, . . . , h.Stage t, t = 1, . . . ,g + h.Concurrently in all binary strings β of length less than h with u(β) < t, compute

successively:(a) Xβ,t,0, based on (10.4),(b) Xβ,t,i+1 for i = 0, 1, 2, 3, 4, based on (10.5),(c) Xβ,t, by (10.6),(d) Vβ1,t+1 = Gβ,t − Eβ,tXβ,tCβ,t, based on (10.1) and (10.7),(e) Vβ1γ,t+1, by (10.3) where α = β1.

These rules are complemented by the following.Stopping criterion: Output the matrices Vα,g+u(α) for all binary strings α of

length at most h and cancel all the subsequent computations involving these matrices.For the reader’s convenience, we will next list the matrices computed at stages 1,

2, and 3, letting γ0, γ1, γ2, γ3 denote unary strings filled with zeros.Stage 1:Xγ0,1,0 = −I/m,

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1107

Xγ0,1,i+1 = Xγ0,1,i(2I + Vγ00Xγ0,1,i), i = 0, 1, 2, 3, 4,

Xγ0,1 = −Xγ0,1,4,Vγ01,2 = Gγ0 − Eγ0Xγ0,1,Cγ0 for Gγ0 , Eγ0 , Cγ0 of (3.1), |γ0| < h,

Vγ01γ1,2 = V(q)γ01,2

for q = 2h−1−|γ0γ1|, |γ0γ1| < h.

Stage 2: Xγ0,2,0 = −Xγ0,1,Xγ0,2,i+1 = Xγ0,2,i(2I + Vγ00,2Xγ0,2,i), i = 0, 1, 2, 3, 4,

Xγ0,2 = −Xγ0,2,5,Vγ01,3 = Gγ0 − Eγ0Xγ0,2Cγ0 for |γ0| < h,

Vγ01γ1,3 = V(q)γ01,3

for q = 2h−1−|γ0γ1|, |γ0γ1| < h,

Xγ01γ1,2,0 = −I/m,

Xγ01γ1,2,i+1 = Xγ01γ1,2,i(2I + Vγ01γ10,2Xγ01γ1,2,i), i = 0, 1, 2, 3, 4,

Xγ01γ1,2 = −Xγ01γ1,2,5,Vγ01γ11,3 = Gγ01γ1,2 − Eγ01γ1,2Xγ01γ1,2Cγ01γ1,2 for Gγ01γ1,2, Eγ01γ1,2, Cγ01γ1,2 of

(10.1), with α = γ01γ1, t = 2, |γ0γ1| < h− 1,

Vγ01γ11γ2,3 = V(q)γ01γ11,3

for q = 2h−2−|γ0γ1γ2|, |γ0γ1γ2| < h− 1.

Stage 3: Xγ0,3,0 = −Xγ0,2,Xγ0,3,i+1 = Xγ0,3,i(2I + Vγ00Xγ0,3,i), i = 0, 1, 2, 3, 4,

Xγ0,3 = −Xγ0,3,5,Vγ01,4 = Gγ0 − Eγ0Xγ0,3Cγ0 for |γ0| < h,

Vγ01γ1,4 = V(q)γ01,4

for q = 2h−1−|γ0γ1|, |γ0γ1| < h,

Xγ01γ1,3,0 = −Xγ01γ1,2,Xγ01γ1,3,i+1 = Xγ01γ1,3,i(2I + Vγ01γ10,3Xγ01γ1,3,i), i = 0, 1, 2, 3, 4,

Xγ01γ1,3 = −Xγ01γ1,3,5,Vγ01γ11,4 = Gγ01γ1,3 − Eγ01γ1,3Xγ01γ1,3Cγ01γ1,3 for Gγ01γ1,3, Eγ01γ1,3, Cγ01γ1,3 of

(10.1) with α = γ01γ1, t = 3, |γ0γ1| < h− 1,

Vγ01γ11γ2,4 = V(q)γ01γ11,4

for q = 2h−2−|γ0γ1γ2|, |γ0γ1γ2| < h− 1,

Xγ01γ11γ2,3,0 = −I/m,

Xγ01γ11γ2,3,i+1 = Xγ01γ11γ2,3,i(2I + Vγ01γ11γ20,3Xγ01γ11γ2,3,i), i = 0, 1, 2, 3, 4,

Xγ01γ11γ2,3 = −Xγ01γ11γ2,3,5,Vγ01γ11γ21,4 = Gγ01γ11γ2,3−Eγ01γ11γ2,3Xγ01γ11γ2,3Cγ01γ11γ2,3 forGγ01γ11γ2,3, Eγ01γ11γ2,3,

Cγ01γ11γ2,3 of (10.1), with α = γ01γ11γ2, t = 3, |γ0γ1γ2| < h− 2,

Vγ01γ11γ21γ3,4 = V(q)γ01γ11γ21,3

, q = 2h−3−|γ0γ1γ2γ3|, |γ0γ1γ2γ3| < h− 2.

Correctness of Algorithm 10.1 is immediatley verified. It remains to specify thechoice of natural g, which would satisfy the requirements of Proposition 7.5, and thento estimate the resulting compuational cost.

As in section 8, we assume infinite precision computations in Algorithm 10.1,but the same techniques of backward error analysis as in section 9 enable relativelysimple transition to the case of computations with rounding to finite precision of ordern log p bits.

The analysis of the approximation errors and of the computation precision givenin section 8 is easily extended. In particular, the extension of Proposition 8.2 and itsproof is immediate provided that in its statement Vβ , Vβ0, and Vβ1 are replaced byVβ,t, Vβ0,t+1, and Vβ1,t+1, t > u(β). Furthermore, the assumptions of this proposi-tion are extended recursively with each increase of t and the length |β| by 1. Suchan extension is analyzed as in section 9. (The transition from Vβ,t to Vβ1,t+1 involvesfive Newton steps (10.5), versus four steps used in section 8; an extra step compen-sates us for the impact of the rounding errors; this suffices according to the analysis

1108 VICTOR Y. PAN

in section 9.) The small factor 5 of the error propagation bound of part (b) of Pro-postion 8.2 (even when it increases to 7 due to the rounding errors) is immediatelysuppressed by Newton’s steps (10.5). To accomodate the factors 5 or 7, we also shouldincrease the upper bound on ||V −1

β0,tXβ,t,0 + I|| obtained in Corollary 7.4; the increase

is from 1/(10n2) to 1/(2n2) or to 7/(10n2), respectively. This, however, implies onlya nominal increase of g, which we may set equal to

g = 1 + �log(b/(2 log n))�,(10.8)

say. (2 log n in the denominator replaces log(10n2), which more than compensatesus for the error propagation factors 5 or 7.) The computation of every Vα,g+u(α)

involves at least 5g Newton steps (10.5), so that the output error norm bound 2−b isguaranteed under (10.8). Therefore, to satisfy the requirement (7.4) of Proposition7.5, it is sufficient to choose b of order n log p. Then, by (10.8), we have

g = O(log(n log p)).(10.9)

Now, let us estimate the computational cost of performing Algorithm 10.1.For any t, Stage t amounts essentially to ten steps of multiplication of at most

n/k pairs of k × k matrices for k = 2�, � = 1, 2, . . . , h. All these multiplicationsfor k = 2� and for all � are performed concurrently. Their overall cost is bounded byOA(log n, n

ω), 2 ≤ ω < 2.376 (compare (4.5) and observe that∑h�=1 2

�(n/2�)ω = O(nω)for ω > 1). Summarizing these bounds for all stages t, t = 1, . . . , g+ h, we obtain thefollowing proposition.

Proposition 10.1. Algorithm 10.1 supports approximating the RD of ac.-d.d. n × n matrix V of (7.1) within the error norm bound 2−b, at the overallcost OA((log n)(log n+ g), nω) for g of (10.8) and ω of (4.5).

By combining Algorithms 10.1 and 6.2, summarizing the estimates for the com-putational cost of their performance, given in particular in (10.9) and Proposition10.1, and extending the rounding error analysis applied in the proof of Proposition9.1, we obtain the following corollary.

Corollary 10.2. The IRD of a c.-d.d. n × n matrix V of (7.1) can be exactlycomputed at the computational cost OA((log n) log(n log p), n

ω) for ω of (4.5), 2 ≤ω < 2.376; moreover, this computation can be performed by only involving operationswith b-bit precision numbers for b = O(n log p).

11. Computing modulo a fixed prime of the ERD of an integer matrix.Our next goal is probabilistic extension of Corollary 10.2 from the class of matricesV of (7.1) to the class of all strongly nonsingular integer matrices A. In this section,we will compute the IRD and even the ERD of A modulo a fixed prime p; in the nextsection we will shift to the IRD of A.

Let A = (ai,j) be a strongly nonsingular n × n matrix filled with integers ai,j .Then by virtue of Proposition 3.2, there exists the RD of A. Let p be a fixed prime,let 0 ≤ fi,j = ai,j mod p < p for all i, j, and let

F = (fi,j) = A mod p .(11.1)

(Here and hereafter, we assume that 0 ≤ a mod p < p for any integer a.)We will compute modulo p the ERD of the matrix F as an auxiliary stage of

computing the ERD of A. At first, we should examine if there exists the ERD modulop of F .

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1109

Lemma 11.1 (see [IR82]). Let f(n) be a function defined on the set of positiveintegers such that f(n) > 0 and limn→∞ f(n) = ∞. Then there exist two positiveconstants C and n0 such that, for any n > n0, the interval

J = {p : f(n)/n < p < f(n)}(11.2)

contains at least f(n)/(C log f(n)) distinct primes.Lemma 11.2. Let f(n), hq(n), and kq(n), q = 1, . . . , Q be some functions in n

such that hq(n) are integer valued, hq(n) �= 0,

0 < (hq(n))1/kq(n) ≤ f(n)/n, kq(n) > 0, limn→∞f(n) = ∞(11.3)

for q = 1, . . . , Q. Let p be a random prime in the interval J of (11.2). Then forthe positive constants C and n0 of Lemma 11.1 and for any fixed n > n0, we havehq(n) �= 0 mod p for q = 1, . . . , Q with a probability at least 1−(CK(n) log f(n))/f(n),

where K(n) =∑Qq=1 kq(n).

Proof. Let lq(n) primes lying in the interval J divide hq(n). Then their productalso divides hq(n) and, therefore, cannot exceed hq(n). As these primes lie in theinterval J , each of them exceeds f(n)/n, and their product exceeds (f(n)/n)lq(n).Hence, (f(n)/n)lq(n) < hq(n). Compare this inequality with the assumed boundhq(n) ≤ (f(n)/n)kq(n) and obtain that lq(n) < kq(n). This holds for all q. Therefore,the number of primes lying in J and dividing at least one of the integers hq(n) (for any

q) is at most∑Qq=1 lq(n) <

∑Qq=1 k(n) = K(n). Compare this number with the overall

number of primes in J estimated in Lemma 11.1 and obtain the desired probabilityestimate.

Proposition 11.3. Let ρ > 2 be a fixed scalar, let A be a strongly nonsingularn × n integer matrix, where n > 1, ‖A‖ > 1, and let p be a prime chosen randomly(under the uniform probability distribution) in the interval J = {p : nρ−1 log ‖ A ‖<p < nρ log ‖ A ‖}. Then p ≥ n, and the matrix F of (11.1) is strongly nonsingularmodulo p with a probability at least 1−Pρ,n for Pρ,n < (n+1)Cn1−ρ and for a positiveconstant C of Lemmas 11.1 and 11.2.

Proof. Apply Lemma 11.2 for f(n) = nρ log ||A||, hq(n) = |detA(q)|, Q = n, and

kq(n) = (q log ||A||)/ log (nρ−1 log ||A||),q = 1, . . . , n. Recall from Proposition 2.4 that |detA(q)| ≤ ||A(q)||q ≤ ||A||q for allq, q ≤ n, and deduce that (11.3) holds for all q ≤ n. We immediately deduce thatK(n) =

∑nq=1 kq(n) = ((n+1)n log ||A||)/(2 log(nρ−1 log ||A||)) and (log f(n))/f(n) =

(log(nρ log ||A||))/(nρ log ||A||). Substitute these expressions forK(n) and (log f(n))/f(n)into Lemma 11.2 and obtain that (detA(q)) mod p �= 0 for q = 1, . . . , n with a proba-bility at least 1− Pρ,n, where

Pρ,n <(n+ 1)nC log (nρ log ||A||)2nρ log (nρ−1 log ||A||) =

(n+ 1)C

2nρ−1

(1 +

log n

log (nρ−1 log ||A||))

for all k. By assumption, we have ||A|| ≥ 2, ρ > 2, n ≥ 2, and it follows thatlog(nρ−1 log ||A||) > log n. Combine this bound with the above bound on Pρ,n andobtain the claimed estimate of Proposition 11.3.

Now, we will assume that a prime p has been chosen in the interval J of Proposi-tion 11.3 and the matrix F of (11.1) is strongly nonsingular modulo p and, therefore,possesses its ERD modulo p.

1110 VICTOR Y. PAN

The next algorithm computes modulo p such an ERD, representing each auxiliaryor output rational value as a pair of its numerator and denominator given as twointegers reduced modulo p. (This enables us to avoid the costly stage of computinginteger reciprocals modulo p.) The RD modulo p of A is computed already at Stage1 of the algorithm. Subsequent stages yield the extending set of the RD modulo p viathe computation of the dual RD modulo p (see the definitions of the extending setand the dual RD in section 3).

Algorithm 11.1. Computing the ERD modulo a fixed prime.Input: a prime p and a pair of strongly nonsingular n×n matrices A and F = A

mod p filled with integers.Output: the (common) ERD modulo p of A and F .Computations:Stage 0. Compute m = 10(np)2 and the c.-d.d. matrix V = F −mI (cf. (7.1),

(7.2)).Stage 1. Compute modulo p the IRD of V by applying Algorithms 10.1 and

6.2. Then, compute modulo p the RD of V , by dividing modulo p all the computedmatrices mαVα of the IRD by the computed multipliers mα for all α; represent theresult of each division by a pair of an entry ofmαVα reduced modulo p andmα mod p.Output the computed RD modulo p of V , which is also the RD modulo p of F = Vmod p.

Stage 2. Recall Proposition 4.5 and compute det V .Stage 3. Recall from Proposition 2.4 that | det V | ≤‖ V ‖n and apply Newton’s

iteration (5.5) for B = V in order to compute an approximation X to V −1 satisfying(5.3) for B = V and for b satisfying

2−b/m <‖ V ‖−n /2.2 .(11.4)

Then, compute the entries of the matrix X det V and round them to the closestintegers, which gives us adj V .

Stage 4. Compute the matrix W = (adj V ) mod p − mI. Apply Algorithms10.1 and 6.2 to compute modulo p the IRD of W (we will prove that this is the dualIRD modulo p of A, V, and F ). Then compute modulo p the matrices Wβ0 of the

RD of W for all binary strings β of length less than h. Output this set of matrices,to be denoted {(Wβ0/det V ) mod p}. Their entries are the pairs of integers, each

reduced modulo p; one integer of each pair is an entry of Wβ0 mod p and another is(detV ) mod p. (This set of matrices defines the extending set {V −1

β0 mod p} of theRD modulo p of the input matrix F .)

To verify correctness of Algorithm 11.1, first extend Corollary 7.3 to obtain that‖ V −1 ‖≤ 1.1/m. Together with (11.4), this implies the bound

‖ X det V − adj V ‖< 1/2

for the matrix X computed at Stage 3 of Algorithm 11.1. Therefore, the rounding atthis stage correctly defines adj V .

Furthermore, the matrices Wα mod p (see Stage 4) represent the RD modulop of adj V . Therefore, the set {(Wα/det V ) mod p} represents the RD modulo pof V −1. To complete the correctness proof, it remains to observe that the set ofmatrices {(Wβ0/det V ) mod p, |β| < h} is nothing else but the extending set {B−1

β0

mod p, |β| < h} of the (common) RD modulo p of the three matrices A, V , and F = Vmod p = A mod p. This follows from the next simple result.

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1111

Proposition 11.4. Let {Vα} and {Wα} denote the RD and the dual RD of apair of n×n matrices V and W = V −1, respectively. Then, V −1

α = Wα for all binarystrings α of length at most h.

Proof. Compare (2.3) and (2.4) to obtain that V −10 = W0, V

−11 = S−1 =

W1. Recursively extend this observation to all binary strings α, to complete theproofs of both of Proposition 11.4 and, consequently, of the correctness of Algorithm11.1.

Similarly to deducing Corollary 10.2, we estimate the complexity of performingAlgorithm 11.1. We arrive at the following proposition.

Proposition 11.5. The ERD modulo a fixed prime p of an n × n matrix Afilled with integers and strongly nonsingular modulo p (that is, such that (det A(q))mod p �= 0 for all q), as well as detA(q) mod p for all q can be computed at the costOA((log n) log(n log p), n

ω) for ω of (4.5), 2 ≤ ω < 2.376; moreover, this computationcan be performed by computing with the b-bit precision operands for b = O(n log p).

Remark 11.1. One can be tempted to simplify Algorithm 11.1 and to computemodulo p the extending set {V −1

α0 } of the RD of the matrix V via a more straightfor-ward application of the techniques of sections 3–10. In particular, one may proceedby following the recipe of [R95]: first approximate the matrices V −1

α0 closely enough,then multiply the approximations by appropriate integer multipliers Mα to arrive atapproximations (within an error norm bounded by less than 1/2) to integer matricesMαV

−1α0 , and then recover the matrices MαV

−1α0 via rounding and V −1

α0 via divisionsby Mα. The problem with this approach is in bounding the size of the multipliersMα. We need to have log |Mα| = O(n) in order to support the bit-precision boundsof Proposition 11.5, but if we follow the cited recipe, we would only reach the boundsof order O(n2) on log |Mα|, which would imply involving extra factor n in the bit-precision and the bit-complexity bounds. Here, the notation O(s) should be read asO(s logc s) for a constant c independent of s.

12. p-adic lifting of the ERDs and the recovery of the inverses, deter-minants, and ranks of integer matrices. In the previous section, we computedthe ERD modulo p of an integer matrix A, which is strongly nonsingular modulo p.We will now compute its p-adic (Newton–Hensel’s) lifting, that is, the ERD modulop2g

of A for a fixed natural g ≥ h = log n. We will achieve this by incorporatingthe known techniques [MC79] for p-adic lifting of matrix inverses into our Algorithm10.1. In this application we will slightly simplify the algorithm by replacing the foursteps of Newton’s iteration of (10.4)–(10.6) by a single step of the computation of thematrix

Xβ,t = Xβ,t,0(2I − Vβ0,tXβ,t,0),(12.1)

where

Xβ,t,0 =

{V −1β0,t mod p for t = u(β) + 1,

Xβ,t−1 for t > u(β) + 1,(12.2)

and all matrices V −1β0,t mod p are supplied as an input to the p-adic lifting algorithm.

(The latter expression for Xβ,t,0 replaces (10.4).) The only other change versus Al-gorithm 10.1 is that all the arithmetic operations in (10.7) and (12.1) are performedmodulo p2s

for s = t− 1− u(β) and for u(β) denoting (as in section 10) the numberof bits one in a binary string β. Hereafter we refer to the resulting algorithm asAlgorithm 12.1.

1112 VICTOR Y. PAN

Correctness of the resulting algorithm follows because (12.1) and the inductiveassumption that Xβ,t−1 = V −1

β0,t−1 mod p2s

, s = t− 2− u(β), together imply that

(I − Vβ0,tXβ,t − (I − Vβ0,tXβ,t,0)2) mod p2s+1

= 0 ,

and, therefore,

Xβ,t = V −1β0,t mod p2s+1

(12.3)

(compare [MC79] or [BP94, Fact 3.3.1, p. 244]).The arithmetic complexity estimates OA((g+ log n) log n, nω) of Proposition 10.1

are extended to the case of Algorithm 12.1, where g denotes a fixed natural inputvalue, g ≥ h = log n.

We will keep assuming that p is a prime fixed in the interval J of Proposition 11.3,‖A‖ > 1, n > 1, and the matrix F of (11.1) is strongly nonsingular. Furthermore,hereafter we will assume that

g = 1 +

⌊log

1 + n log ‖A‖log p

⌋.(12.4)

Then, we have

4‖A‖2n ≥ p2g

> 2‖A‖n.(12.5)

Therefore, by the virtue of Proposition 2.4, the value 0.5p2g

exceeds |detA| aswell as the maximum absolute value of any entry of adj A. We observe that

q =

{q mod p if q mod p < 0.5q,(q mod p)− p otherwise,

provided that q is an integer and 2|q| < p. These observations, Corollary 4.6, andrelations (12.5) together enable us to recover det A from (det A) mod p2g

and adj Afrom (adj A) mod p2g

, as the p-adic lifting of the ERD is completed. Then, we mayimmediately compute A−1 = (adj A)/det A, since A is a nonsingular matrix.

Remark 12.1. We may control the computational precision at the last liftingstage (where the precision is the largest) simply by performing this stage modulo pq,where q = �log(2‖A‖n)�+ 1, so that 2‖A‖n ≤ pq ≤ 2p‖A‖n.

Summarizing the algorithms and the complexity estimates of this and the previoussections, we arrive at the following proposition.

Proposition 12.1. Let A be a strongly nonsingular n×n matrix filled with inte-gers. Let n > 1, let ‖A‖ > 1, and let p be a prime from the interval J of Proposition11.3 for a fixed ρ > 2. Furthermore, let the matrix A be strongly nonsingular modulop too. Then, one may compute A−1 and det A(k), k = 1, 2, . . . , n, in two stages thatamount essentially to application of Algorithms 11.1 and 12.1, respectively, and areperformed at the arithmetic cost bounded by OA((log n) log(n log p), n

ω), at the firststage (compare Proposition 11.5) and OA((log n) log(n log ||A||), nω), at the secondstage, for ω of (4.5), 2 ≤ ω < 2.376.

Assuming p chosen from the interval J of Proposition 11.3, we obtain that log p =O(log(n log ||A||)), so that the overall arithmetic cost is dominated by the cost of thesecond stage.

Corollary 12.2. Under the assumptions of Proposition 12.1, one may computeA−1 and det A(k) for k = 1, 2, . . . , n, at arithmetic cost OA((log n) log(n log ‖A‖), nω)for ω of (4.5).

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1113

Let us extend Proposition 12.1 and Corollary 12.2 to estimate at first the bit-precision and then the Boolean complexity of the same computations.

We immediately recall the bound O(n log p) on the bit-precision required inAlgorithm 11.1, that is, at the first stage of the computations of Proposition 12.1.At the second stage (that is, essentially for Algorithm 12.1), we revisit the derivationof Proposition 10.1, where we estimated the complexity of the stage of numericalapproximation of the RD and ERD of V , and recall or estimate again that this stageis essentially reduced to at most g+h substages for g of (12.4) and for h = log n, suchthat the cost of performing each substage is dominated by the cost of ten steps ofmultiplication of at most n/k pairs of k×k matrices for k = 2� and � = 0, 1, . . . , h−1.At the stage of the application of Algorithm 12.1, only four (instead of ten) stepsare needed. At each of such four steps, all the matrix multiplications are performedconcurrently, as in the case of the derivation of Proposition 10.1. Furthermore, atevery step of Substage t of the second stage, t = 1, . . . , g + h, at most n/2l pairs ofmatrices of the sizes 2l×2l are encountered for l = n−|β|−1, u(β) < t. Such matrices

are pairwise multiplied together modulo p2t−u(β)

,

t− u(β) ≤ λ(t) = min {t, g}.(12.6)

The above bounds on the modulo imply some bit-precision bounds since compu-tation modulo � can be performed with 2�log ��-bit-precision. Furthermore, we recallthe known estimates OB((log k) log log k, k) for the Boolean complexity of performingan arithmetic operation modulo 2k− 1 (see [AHU74], [BP94], [CK91], [RT90]), whichcan be extended to our computations whenever we perform them with k-bit-precision.

By combining the latter estimates with estimates for the arithmetic cost andfor the bit-precision of our computations, we bound the Boolean cost of performingAlgorithm 11.1, that is, the first stage of the computations supporting Proposition12.1 (cf. Corollary 10.1) by

OB((log n)(log(n log p))2 log log(n log p), nω+1 log p),

and we bound the Boolean cost of performing the tth stage of Algorithm 12.1 by

OB((log n)(log(2λ(t) log p)) log log(2λ(t) log p), nω2λ(t) log p), t = 1, . . . , g + h.

(Compare (12.6) and recall that the tth stage of Algorithm 12.1 is the tth substageof the second stage of the computations of Proposition 12.1.)

By summarizing all these estimates, for p lying in the interval J of Proposition11.3 and for g satisfying (12.4), (12.5), we estimate the Boolean complexity of ourcomputations. To simplify the expressions for the resulting estimates, we write

A = (ai,j), a = logmaxi,j

|ai,j |(12.7)

and obtain that g = O(log(na)), g+ h = O(log(na)), 2g log p = O(na) for g of (12.4),log p = O(log(na)), log(n log p) = O(log(n log a)). Then, we rewrite our Boolean costbounds as follows:

OB((log n)(log(n log a))2 log log(n log a), nω+1 log(na))

for performing Algorithm 11.1,

OB((log n)(log(na)) log log(na), nω+1a)

1114 VICTOR Y. PAN

for performing the tth stage of Algorithm 12.1 for t = g+1, . . . , g+h, where λ(t, g) =g + 1, and

OB((log n)(t+ log log(na)) log(t+ log log(na)), nω2t log p)

for performing the tth stage of Algorithm 12.1 for t = 1, . . . , g, where λ(t, g) = t.By applying the B-principle, we bound the overall cost of performing the first g =O(log(na)) stages of Algorithm 12.1 by

OB((log n)(log(na))2 log log(na), nω+1a/ log(na)),

and we bound the overall cost of performing its last h = log n stages by

OB((log n)2(log(na)) log log(na), nω+1a).

Then again, we apply the B-principle to yield the same parallel Boolean time bound,O((log n)(log(na))2 log log(na)), in all the three estimates (for Algorithm 11.1, for thefirst g stages of Algorithm 12.1, and for its last h stages), which gives us the Booleanprocessor bounds

O((nω+1(log(n log a))2 log log(n log a))/(log(na) log log(na)))

= O(nω+1(log(n log a))2/ log(na)),

O(nω+1a/ log(na)),

and

O((nω+1a log n)/ log(na))

for these three groups of computations, respectively. We note that the sum of thethree latter bounds gives us O((log n)(a+ log n)nω+1/ log(na)).

By using the Boolean cost bounds of Proposition 11.5 for computing detA(k) modp for all k, and by combining the cited Boolean time bound and the latter processorbound, we obtain the following proposition.

Proposition 12.3. Under the assumptions of Proposition 12.1, one may computethe inverse matrix A−1 mod p and detA(k) mod p, k = 1, . . . , n, at the Boolean cost

OB((log n)(log(n log a))2 log log(n log a), nω+1 log(na)),

and one may compute the matrix A−1 and detA(k), k = 1, . . . , n, at the Boolean costOB((log n)(log(na))

2 log log(na), (log n)(a+ log n)nω+1/ log(na)) for ω of (4.5) and aof (12.7).

Remark 12.2. Our choice of a prime p and our complexity estimates rely on thebounds of Proposition 2.4 on |detW |. For a large class of matrices W , such boundscan be refined a little (e.g., by using Hadamard’s upper bound on |detA|) and so canour complexity estimates. Likewise, by expressing the estimates of Proposition 12.3in terms of ||A|| rather than a, one may obtain some slightly refined (though morecomplicated) estimates. Finally, our estimates for parallel Boolean cost can be slightlyimproved if, instead of the bounds OB((log k) log log k, k) on the cost of an arithmeticoperation, we will rely on the bounds OB(log k, k log log k), which hold for the cost of

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1115

an addition, a subtraction and a multiplication (see, e.g., [BP94, p. 297]). We mayrely on the latter bound because the ops of the latter three classes are most numerousamong all the ops in our algorithms. Similar observations apply to the estimates ofTheorems 1.1 and 1.2.

It remains to work out the strong nonsingularity issue in order to extend the com-plexity estimates of Corollary 12.2 and Proposition 12.3 to estimates of Theorem 1.1.(Note that, in terms of a, the bounds of Corollary 12.2 turn into OA((log n) log(na),nω), as required in Theorem 1.1.)

We will first assume that A is a nonsingular matrix. In this case, AAT is ans.p.d. matrix and, consequently, a strongly nonsingular matrix, by Corollary 2.11.Consequently, AAT is strongly nonsingular modulo p, with a probability 1− Pρ,n forPρ,n bounded according to Proposition 11.3. Therefore, we may apply the results ofthis section to compute at first (AAT )−1 and then A−1 = AT (AAT )−1 and x = A−1fsatisfying Ax = f . (Strong nonsingularity (modulo p) of AAT is tested as a by-productof computing (AAT )−1.) We may also immediately compute det(AAT ) = (detA)2,though this does not give us the sign of detA. The matrix A is singular (that is, detA = 0) if and only if application of the same approach to a matrix A requires us toinvert a singular matrix at some step.

Next, we will apply randomization to relax the assumptions about (strong) non-singularity of A when we compute rank A and the sign of det A. Towards this goal,we fix ρ > 2, a sufficiently large finite set of integers, S, and two matrices U andL, as specified in Proposition 2.19; we compute the matrix A = UAL (cf. Remark12.3 at the end of this section), fix a random prime p in the interval J of Proposition11.3, and extend Algorithm 11.1 to compute (det A(k)) mod p for k = 1, . . . , n, andr(p) = max{k, (det A(k)) mod p �= 0}. Let us write r = max{k, det A(k) �= 0},so that rank A ≥ r ≥ r(p). Furthermore, r = rank A with a probability at leastPr = 1 − (r + 1)r/|S| (due to Proposition 2.19), and r = r(p), with a probability1 − Pρ,n, estimated in Proposition 11.3. Thus, we output r(p) as rank A and arriveat the estimate of Theorem 1.1 for the randomized cost of computing rank A. (Notethat in this case, the computations modulo p suffice; thus, in our computation ofrank A, we omit the p-adic lifting stage and rely on the first Boolean cost estimate ofProposition 12.3.)

Let us extend this technique to the computation of the sine of det A. If r(p) < n,then (det A) mod p = 0, and we output detA = 0, which is correct with a probabilityat least 1 − Pρ,n. Otherwise, that is, if r(p) = n, then we have n ≥ rank A ≥r(p) = n; that is, A is nonsingular. Furthermore, by using the randomization basedon Proposition 2.19, we may compute detA = det(UAL), because UAL is stronglynonsingular, with a probability at least 1−(n+1)n/|S| if A is nonsingular. By letting|S| = n4, say, and by applying Propositions 11.5 and 12.3 to the matrix UAL, wearrive at the desired algorithm for detA, supporting Theorem 1.1.

Now, assume that r(p) < n and that the r(p)× r(p) leading principal submatrix

B = A(r(p)) of A is nonsingular. Let us write A = (BDCE ), G = ( I0

−B−1CI ), and

observe that AG = (BD0Q ), where Q = 0 if and only if r(p) = rank A. (Compare

[KP91] and [BP94, pp. 110 and 333].) This gives us an algorithm for verificationwhether r(p)=rank A (at the cost within the asymptotic cost bounds of Theorem 1.1).If so, then the n−r columns of the matrix LG( 0

I ), where I denotes the (n−r)×(n−r)identity matrix, give us a basis for the null-space, N(A), of A (compare Definition2.17). We recall from Fact 2.1 that if there exists a solution x to a linear systemAx = f , then it can be represented as x = x0 + z, x0 being a fixed specific solution

1116 VICTOR Y. PAN

and z being a vector from N(A).Let g be the r-dimensional prefix-subvector of f , made by the first r components

of f . Let y = B−1g be the solution to the nonsingular system By = g. Then, aspecific solution x0 to the system UAx = U f is given by x0 = LG(y

0 ) if the latterlinear system is consistent, and we have UAx0 �= U f otherwise. This completes ourproof of Theorem 1.1.

Remark 12.3. Our computations supporting Theorem 1.1 include some n × nmatrix multiplications (of A by AT , L, and U). Their cost bound is dominated bythe complexity bounds of Theorem 1.1, and a similar argument applies to yield theextension of this theorem to Theorem 1.2, to be shown in section 14 (cf. Proposition14.1). The increase of the matrix norm in the transition from A to AAT and A = UALmay cause the increase only by a constant factor in the estimate for the precision ofthe computations and their Boolean complexity (if we choose, say, S = {1, 2, . . . , |S|}and |S| = nO(1)).

13. Some definitions and auxiliary results on computations with struc-tured matrices. Our next goal is to show that the computational cost of our algo-rithms supporting Theorem 1.1 decreases dramatically, to the level of the estimatesof Theorem 1.2, provided that the input matrix has Toeplitz-like structure. In thissection we will recall some definitions and some simple and/or well-known facts onToeplitz-like matrices, which we will use in the next section towards the stated goal(cf. (1.1) and (1.2) of section 1.1, Definition 2.18, and [BP94], [CKL-A87], [KKM79],[P92]).

Proposition 13.1. The product of a k × k Toeplitz matrix (cf. Definition 2.18)and a vector of dimension k can be computed at the cost OA(log k, k) (via reductionto three FFTs, each on O(k) points, or to convolution of two vectors of dimensionO(k)).

Definition 13.2. For a k× k matrix A and for the matrix Z of Definition 2.18,write F+(A) = A−ZAZT , F−(A) = A−ZTAZ. If F (A) = GHT for a pair of k× �matrices G and H and for F = F+ or F = F−, then the pair of G, H is called anF -generator of A of length �. (Note that, in this case, the pair H, G is an F -generatorof AT of the same length.) The minimum length � of an F -generator of A, for fixedA and F , is called the F -rank of A, is denoted by rF (A), and is equal to rank F (A).A k × k matrix A is called a Toeplitz-like matrix if it is given with its F -generator(for F = F+ or F = F−) having a length bounded by a constant independent of k.F -generators and F -ranks, for both F = F+ and F = F−, are also called displacementgenerators and displacement ranks (following the original definitions of [KKM79]).

Proposition 13.3. rF (T ) ≤ 2 if T is a Toeplitz matrix, and rF (T ) ≤ 1 if T isa triangular Toeplitz matrix for F = F+ and F = F−. In particular, rF (I) = 1.

The correlation to (1.2) is given by the following result.Proposition 13.4. G, H is an F+-generator (respectively, F−-generator) of A

having a length �, G = (g1, . . . ,g�), H = (h1, . . . ,h�), if and only if

A =∑�s=1 L(gs) L

T (hs) (respectively, if and only if A =∑�s=1 LT (gs) L(hs)).

Based on the latter results, we will operate with the F -generators of Toeplitz-likematrices, rather than with the matrices themselves. Such a representation is memoryspace efficient and also enables us to use less sequential time and fewer processors inToeplitz-like computations, due to the following corollary (cf. Propositions 13.1 and13.4).

Corollary 13.5. The product of a k × k Toeplitz-like matrix by a vector ofdimension k can be computed at the cost OA(log k, k).

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1117

The next result gives us more specific estimates—the cost bound of Toeplitz-likematrix multiplication is proportional to the square of the sum of the lengths of theF -generators of the input matrices, and such a length is roughly doubled in a matrixaddition or multiplication.

Proposition 13.6. Given F -generators, GA, HA of length �A and GB , HB oflength �B, of k×k matrices A and B, respectively (for F = F+ or F = F−), one maycompute an F -generator GAB , HAB of AB of length at most �A + �B + 1 at the costOA(log k, (�A + �B)

2k), whereas an F -generator of A + B of length at most �A + �Bis immediately available cost-free.

In view of the latter results, we will study various bounds on the F -ranks and thelength of F -generators, in particular regarding the matrices involved in the RD andNewton’s iteration with Toeplitz-like input.

Proposition 13.7.

(a) rF+(A) ≤ rF−(A)+2, rF−(A) ≤ rF+(A)+2 for any matrix A. Furthermore,an F+-generator (respectively, F−-generator) of a length � for any matrix A can beimmediately transformed (at the cost OA(log n, n) of performing O(1) convolutions orFFTs) into an F−-generator (respectively, F+-generator) of length at most � + 2 forA.

(b) If A is nonsingular, then rF+(A−1) = rF−(A).

The next result is immediately verified (compare Definition 2.6).

Proposition 13.8. Let GHT = F+(W ) for a k×k matrix W . Then (GHT )(i) =F+(W

(i)) for i = 1, 2, . . . , k; furthermore, rF+(C) ≤ rF+(W )+1, rF+(E) ≤ rF+(W )+1, under (2.1), and rF+(T ) ≤ rF+(W ) + 2 for any submatrix T of W formed bycontiguous sets of row and columns of W .

It follows that rF+(B) ≤ rF+(W ), under (2.1).

We observe similar relations for trailing principal submatrices and the operatorF−. By Proposition 2.7, S−1 is a trailing principal submatrix of W−1. Therefore,rF−(S

−1) ≤ rF−(W−1). By applying Proposition 13.7 (b) for A = S and A = W , we

obtain that rF+(S) ≤ rF+(W ).

Proposition 13.9. Let (2.1) and (2.2) hold, where B, S, and W are nonsingularmatrices. Then max{rF+

(B), rF+(S)} ≤ rF+

(W ).

By applying the latter proposition recursively, we bound the F+-rank throughoutthe RD.

Corollary 13.10. Let Vα be a matrix of the RD of a matrix A. Then, rF+(Vα) ≤rF+(A).

So far, we have no tools yet to counter the growth of the length of the F -generatorsin the process of Newton’s iteration. Developing such tools (which we call the tech-niques for the truncation of a generator (TG)) is our next task. Namely, we will next(in Proposition 13.11) show how to compute a shorter F -generator of a matrix havingsmall F -rank but given with its longer F -generator. This is our first technique of TG.It will be used to refine p-adic (Newton–Hensel’s) lifting to bound the length of the F -generators of the matrices involved there. We will prove easily, based on Propositions13.7 and 13.9, that such matrices have small F -rank if so has the input matrix. ForNewton’s iteration of Algorithm 5.1, such a property does not hold, and the F -rankof the computed approximations to the Toeplitz-like inverses may grow quite rapidly.These approximations, however, always have matrices with small F -rank nearby, andwe will periodically shift to the latter matrices and then restart Newton’s process.Our tool for such a shift will be Algorithm 13.1 (see [PBRZ99] on some alternativetools).

1118 VICTOR Y. PAN

Proposition 13.11. Let an F -generator of a k × k matrix A of length � (forF = F+ or F = F−) and an upper bound r∗ < l on the F -rank rF (A) be given. Thenan F -generator of A of length at most r∗ can be computed at the cost OA(l, kl).

Proof. Apply the proof of Proposition A.6 of [P92] or the solution of Problem 2.11of [BP94, pp. 111–112]. Verify that all the computations (including the computationof the LSP factorization or, alternatively, the PLU factorization) can be performedat the claimed overall cost.

Let us next show the promised alternative algorithm for controlling the length ofF -generators of matrices involved in Newton’s process. The algorithm relies on theSVD truncation of F -generator, which is our second TG technique.

Algorithm 13.1 ([P92b], [P93], [P93a]).Input: F = F+ or F = F−, an F -generator G, H of a k × k matrix A of length

l, and a natural r′ < l.Output: an F -generator G′, H ′ of a k × k matrix A′ of length at most r′ such

that

‖A′ −A‖2 ≤ 2(1 + 2(rF (A)− r′)k)minY

‖Y −A‖2 ,(13.1)

where the minimum is over all k × k matrices Y of F -rank at most r′.Computations:Stage 1. Compute the singular value decomposition (SVD) of the matrix GHT =

F (A); that is, compute a pair U and V of unitary k× l matrices and an l× l diagonalmatrix Σ = diag(σ1, . . . , σl) for positive σ1, . . . , σl satisfying

GHT = F (A) = UΣV T .

Stage 2. Compute and output an F -generator G′, H ′ of A′ of length at most r′

as follows:

G′ = UΣr′ , H ′ = V Ir′,l,

where Σr′ = diag(σ1, . . . , σr′ , 0, . . . , 0) and Ir′,l = diag(1, . . . , 1, 0, . . . , 0) are l × lmatrices of rank r′.

On the correctness proof of this algorithm, on the bound OA(log k, k/ log k) forl = O(1), and on the computational cost of its performance, see [P92b], [P93], [P93a].

Remark 13.1. Bound (13.1) is proved in [P92b], [P93], [P93a], based on approx-imate computation of the SVD at Stage 1 of the algorithm. Any improvement of theapproximation of the SVD would decrease the factor 2 of (13.1), which turns into 1 ifthe SVD is computed exactly.

Remark 13.2. If r′ ≥ rF (A), then (13.1) implies that ‖A′ − A‖2 = minY ‖Y −A‖2 = 0, and then Algorithm 13.1 is an alternative to the algorithm supporting Propo-sition 13.11, except that the latter algorithm is rational (it can be performed with noerrors over the rational), whereas Algorithm 13.1 has a nonrational, though numeri-cally stable stage of computing the SVD. This suggests that the algorithm supportingProposition 13.11 should be applied in Algorithm 12.1, at the p-adic lifting stage,whereas Algorithm 13.1 is a better candidate to use in numerical applications of Al-gorithm 11.1, performed with rounding.

14. Improvement of the algorithms for the ERD, IRD, inverse, deter-minant, and rank in the Toeplitz and Toeplitz-like cases. Let us apply thetechniques and the results of the previous section to reexamine the computation of

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1119

the ERD and IRD of a strongly nonsingular n×n matrix A filled with integers in thecase where A is a Toeplitz or Toeplitz-like matrix given with its F+-generator G, Hof length r = rF+(A) = O(1).

We recall that rF+(Vα) ≤ r for all matrices Vα of the RD of A (compare Corol-lary 13.10), and we will apply either the algorithm supporting Proposition 13.11 orAlgorithm 13.1 in order to decrease (to a level at most r) the length of the computedF -generators of these matrices, in all cases where this length exceeds r. Likewise, wewill obtain from Propositions 13.6–13.8 that the computation of Vβ1,t+1, accordingto (10.1)–(10.7), only involves matrices whose F -ranks are bounded from above by3r + rF+(Xβ,t) + 6.

According to our analysis, the matrixXβ,t approximates B−1β0 for all binary strings

β of length at most h−1, and since rF+(Bβ0) ≤ r, we have rF−(B−1β0 ) ≤ r, rF+(B

−1β0 ) ≤

r+ 2 (compare Proposition 13.7). We will apply Algorithm 13.1 in order to computean F+-generator of length at most r + 2 for a matrix X ′

β,t approximating Xβ,t and,

therefore, also V −1β0 . (The approximation of V −1

β0 by X ′β,t deteriorates slightly, versus

the approximation by Xβ,t, but since X′β,t still closely approxiamtes the matrix Vβ0,

we more than compensate ourselves for such a deterioration by performing an extraNewton step in (10.5).) Then, all matrices involved in the computation of the ERDand the IRD of A will be represented by their F+-generators of length O(r).

A similar argument is applied to the computation of the p-adic lifting of the ERDof A, except that this argument is simplified since (12.3) and Proposition 13.7 togetherimply that

rF+(Xβ,t+1 mod p2t−u(β)

) ≤ rF−(Xβ,t+1 mod p2t−u(β)

) + 2

= rF+(B−1β0 mod p2t−u(β)

) + 2 ≤ r + 2.

Thus, to keep the length of the associated F+-generators bounded, we just apply therational algorithm that supports Proposition 13.11, instead of applying Algorithm13.1. In fact, we may also apply other alternative techniques for bounding the lengthof an F -generator of Xα,i+1; such techniques may rely on using distinct operators F ,such as F+(A) = AZ − ZA (see [BP94, p. 189]) or operators using some f -circulantmatrices instead of Z (see [PBRZ99], [P00]).

Finally, it is easily verified (cf. [P96b]) that the computation (of section 12) ofa basis for the null-space of A also involves only matrices represented by their F+-generators of length O(r) for a matrix A given with its F+-generator of length r.

Let us now turn to estimating the computational cost, in the case of Toeplitz orToeplitz-like input. There are two new features versus the case of a general integerinput matrix A.

(1) Performing every matrix multipication, we operate with F+-generators ofToeplitz-like matrices involved in these multiplications and apply Propositions 13.4,13.6, and Corollary 13.5.

(2) Some of these matrix multiplications are followed by the application of thealgorithms supporting Proposition 13.11 or Algorithm 13.1.

The manipulation with the F+-generators enables us to decrease the arithmeticprocessor bound of Corollary 12.2 from nω to n log n, because concurrent multiplica-tions of O(2t) pairs of (n/2t)×(n/2t) Toeplitz-like matrices for t = 1, . . . , h, h = log nare performed at the overall cost bounded by OA(log n, n log n) (versus OA(log n, n

ω)in the case of general integer input matrices). The estimated overall cost of the re-quired computations (of A−1, det A, and so on ) is dominated by the estimated cost of

1120 VICTOR Y. PAN

all Toeplitz-like matrix multiplications involved, because, according to section 13, theestimated cost of such a multiplication dominates the estimated cost of the applicationof both Algorithm 13.1 and the algorithm supporting Proposition 13.11.

Summarizing, we obtain the following result.

Proposition 14.1. If the n× n input Toeplitz-like matrix A is strongly nonsin-gular and is filled with integers, then one may modify the randomized computation ofits ERD and IRD according to the algorithms of sections 6–12 in order to perform allthese computations at the overall cost OA((log n) log(n log ‖A‖), n log n).

The cost bounds of Proposition 14.1 are immediatley extended to the solutionof all the computational problems listed in Theorem 1.1, where now we assume aToeplitz-like input matrix A and represent its inverse or the basis matrix for its null-space by their short F -generators. (Verifying the correctness of the computation of therank and the inverse, we should also deal with short F -generators and use Proposition13.11 to avoid processing n2 entries of n×n matrices, which would have required orderof n2 ops.)

To obtain a similar extension of the Boolean complexity bounds of Proposition12.3 and Theorem 1.1, let us examine the precision of the computations by our algo-rithms simplified in the Toeplitz-like case. We recall that our Toeplitz-like computa-tions can be ultimately reduced to vector convolutions (Propositions 13.1, 13.4, and13.6). Thus, we will bound the cost of our computations at the p-adic lifting stagebased on the following estimate.

Proposition 14.2. Given two vectors of dimension n filled with integers lyingin the range from 0 to 2k − 1, the convolution of these vectors can be computed at theBoolean cost OB((log(kn), kn log log(kn)).

Proof. The well-known binary segmentation techniques (see, e.g., [BP94, section3.9]) reduces our convolution problem to the multiplication of two integers lying inthe range from 0 to 2kn− 1, and the known algorithms solve this task at the requiredcost.

The resulting Boolean cost bounds for performing the p-adic lifting stage willrepeat the bounds of section 12, except that the Boolean (like arithmetic) processorbounds will decrease by factor nω−1/ log n.

Let us show that this holds also for the Boolean cost of the rest of our computation.

When we approximate the ERD of an input Toeplitz-like matrix, we will effec-tively reduce the computations to performing FFTs (see Propositions 13.1 and 13.4)and will recall Corollary 3.4.1 on pp. 255–256 of [BP94], which shows a numericallystable implementation of FFT. We also recall that the known algorithms for the com-putation of the SVD of a matrix are numerically stable (see [GL89/96], [P93]). Fromthese observations, we deduce that we may perform the computations with the samebit-precision (up to a constant factor independent of n), no matter whether we applyour original Algorithm 11.1 for an arbitrary n × n input matrix or its Toeplitz-likemodification. Since in the latter case we use by factor nω−1/ log n fewer arithmeticprocessors, we will also use by factor nω−1/ log n fewer Boolean processors, thus re-placing nω for ω of (4.5) by n log n in the Boolean cost estimates of section 12.

This enables us to extend Theorem 1.1 to arrive at Theorem 1.2.

Remark 14.1. Inspection of our algorithms shows immediately that Proposition14.1 and Theorem 1.2 can be extended to the case where the input matrix A is givenwith its F -generator of length r, provided that both time and processor bound increaseby factor r. It is possible to confine the cost increase to processor bound (increasing it

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1121

by factor r2). The only nontrivial stage is the decrease of the length of F -generators(cf. Proposition 13.11 and Algorithm 13.1). The algorithm supporting Proposition13.11, however, can be modified by extending the probabilistic techniques of the proofof Theorem 1.1 (this would include, in particular, application of Proposition 2.19using n + r extra random parameters), whereas Algorithm 13.1 should be replaced byan alternative approach of [PBRZ99].

Remark 14.2. It may seem that Theorems 1.1 and 1.2 can be supported by asubstantially simpler construction, and simplified construction has indeed been pro-posed in [R95]. Unfortunately, however, the construction of [R95] has no power forsupporting the claimed results. In particular, the construction relies on the two “sim-plifying” recipes cited in our Remarks 6.1 and 11.1, and each of the recipes invalidatesthe resulting algorithm. (See [P96c] for more details on these and some other of themany mishaps of [R95], and note also that the main result of the paper [R93], citedin [R95], is a rediscovery of some results of [BT90] and [BP91].) It is instructive, forgetting better insight, to discuss two other major gaps of the construction of [R95] andof its analysis presented in [R95]. Both gaps are in area of Toeplitz-like computations,where [R95] becomes particularly prone to serious errors. In [R95], an algorithm of[BA80] is used in order to decrease the length of an F-generator of a matrix A to thelevel r =rank F (A). Unlike our Algorithm 13.1 for the SVD truncation and our algo-rithm supporting Proposition 13.11, the algorithm of [BA80] only works if (F (A))(r),the r×r l.p.s of F (A), is nonsingular. Furthermore, to support the algorithm of [R95],one must have matrix (F (A))(r) well-conditioned. Actually, to salvage the algorithmof [R95] at this point, one would have had to use some techniques that are absentfrom [R95] and are substantially more advanced than ones used in [R95]. Likewise,some techniques are required to prevent the F -ranks of the computed approximationsto A−1 from their disturbing growth (from the desired constant level to the level n)in less than log n Newton’s steps, and then again, such techniques are absent from[R95] and are substantially more advanced than ones used in [R95]. The growth of theF -ranks immediately implies the growth by the extra factor nω−1 log n (for ω of (4.5))of both arithmetic and Boolean processor complexity bounds, versus the ones claimedin [R95].

15. Discussion. Our paper leaves as a major open question of theoretical im-portance whether the level of our parallel complexity estimates of Theorem 1.2 forToeplitz and Toeplitz-like computations can be reached by means of purely algebraicapproach, using no rounding to the closest integers. This question is also of practicalinterest because the algorithms of this paper involve the exact computation of detAand, therefore, at some stage require us to use the precision of computation of orderlog |detA|, which generally means the order of n log ||A||, even if we only need theoutput with a much lower precision. Historically, a similar open problem had arisenfor computations with general integer matrices, after the appearance of [P85], [P87].In that case (for general integer matrices), the subsequent works of [KP91], [KP92],[P91], and [P92] gave us an alternative randomized algebraic solution that involvedno rounding. Will this be eventually done also in the Toeplitz-like case or at least inthe Toeplitz case?

Acknowledgments. Detailed and thoughtful comments by a referee and by areviewer helped me a great deal to improve my original draft and to make it moreaccessible for the reader. The request by the area editor Joachim von zur Gathen toincorporate the appendix into the body of the paper also served the same goal.

1122 VICTOR Y. PAN

REFERENCES

[AGr88] G. S. Ammar and W. B. Gragg, Superfast solution of real positive definite Toeplitzsystems, SIAM J. Matrix Anal. Appl., 9 (1988), pp. 61–76.

[AHU74] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis ofComputer Algorithms, Addison-Wesley, Reading, MA, 1974.

[B68] E. H. Bareiss, Sylvester’s identity and multistep integer-preserving Gaussian elim-ination, Math. Comp., 22 (1968), pp. 565–578.

[BA80] R. R. Bitmead and B. D. O. Anderson, Asymptotically fast solution of Toeplitzand related systems of linear equations, Linear Algebra Appl., 34 (1980), pp.103–116.

[Be68] E. R. Berlekamp, Algebraic Coding Theory, McGraw-Hill, New York, 1968.[Be84] S. Berkowitz, On computing the determinant in small parallel time using a small

number of processors, Inform. Process. Lett., 18 (1984), pp. 147–150.[BGH82] A. Borodin, J. von zur Gathen, and J. Hopcroft, Fast parallel matrix and GCD

computation, Inform. and Control, 52 (1982), pp. 241–256.[BGY80] R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun, Fast solution of Toeplitz systems

of equations and computation of Pade approximations, J. Algorithms, 1 (1980),pp. 259–295.

[BK87] A. Bruckstein and T. Kailath, An inverse scattering framework for several prob-lems in signal processing, IEEE Acoustics, Speech and Signal Processing (ASSP)Magazine, January 1987, pp. 6–20.

[BL80] D. Bini and G. Lotti, Stability of fast algorithms for matrix multiplication, Numer.Math., 36 (1980), pp. 63–72.

[BMP98] D. Bondyfalat, B. Mourrain, and V. Y. Pan, Controlled iterative methods forsolving polynomial systems, in Proceedings of the Annual ACM InternationalSymposium on Symbolic and Algebraic Computation, ACM, New York, 1998,pp. 252–259.

[BP91] D. Bini and V. Y. Pan, Parallel complexity of tridiagonal symmetric eigenvalueproblem, in Proceedings of the 2nd Annual ACM-SIAM Symposium on DiscreteAlgorithms, ACM, New York, SIAM, Philadelphia, 1991, pp. 384–393.

[BP93] D. Bini and V. Y. Pan, Improved parallel computation with Toeplitz-like and Hankel-like matrices, Linear Algebra Appl., 188/189 (1993), pp. 3–29.

[BP94] D. Bini and V. Y. Pan, Polynomial and Matrix Computations, Fundamental Algo-rithms 1, Birkhauser, Boston, 1994.

[BT71] W. S. Brown and J. F. Traub, On Euclid’s algorithm and the theory of subresul-tants, J. ACM, 18 (1971), pp. 505–514.

[BT90] M. Ben-Or and P. Tiwari, Simple algorithm for approximating all roots of a poly-nomial with real roots, J. Complexity, 6 (1990), pp. 417–442.

[Bun85] J. R. Bunch, Stability of methods for solving Toeplitz systems of equations, SIAMJ. Sci. Statist. Comput., 6 (1985), pp. 349–364.

[C47/48] S. Chandrasekhar, On the radiative equilibrium of a stellar atmosphere, Astrophys.J., 106 (1947), pp. 152–216, 107 (1948), pp. 48–72.

[C74] R. W. Cottle, Manifestation of the Schur complement, Linear Algebra Appl., 8(1974), pp. 189–211.

[Ch85] A. L. Chistov, Fast parallel calculation of the rank of matrices over a field of arbi-trary characteristics, in Fundamentals of Computation Theory (Cottbus, 1985),Lecture Notes in Comput. Sci. 199, Springer, Berlin, 1985, pp 63–69.

[CK91] D. G. Cantor and E. Kaltofen, On fast multiplication of polynomials over arbi-trary rings, Acta Inform., 28 (1991), pp. 697–701.

[CKL-A87] J. Chun, T. Kailath, and H. Lev-Ari, Fast parallel algorithm for QR-factorizationof structured matrices, SIAM J. Sci. Statist. Comput., 8 (1987), pp. 899–913.

[Cs76] L. Csanky, Fast parallel matrix inversion algorithms, SIAM J. Comput., 5 (1976),pp. 618–623.

[dH87] F. R. de Hoog, On the solution of Toeplitz systems, Linear Algebra Appl., 88/89(1987), pp. 123–138.

[E67] J. Edmonds, Systems of distinct representatives and linear algebra, J. Res. Nat. Bur.Standards, 71B (1967), pp. 241–245.

[EG88] D. Eppstein and Z. Galil, Parallel algorithmic techniques for combinatorial com-putation, Annual Rev. Comput. Sci., 3 (1988), pp. 233–283.

[EP97] I. Z. Emiris and V. Y. Pan, The structure of sparse resultant matrices, in Proceed-ings of the Annual ACM International Symposium on Symbolic and Algebraic

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1123

Computation, ACM, New York, 1997, pp. 189–196.[F64] L. Fox, An Introduction to Numerical Linear Algebra, Oxford University Press,

Oxford, UK, 1964.[G84] J. von zur Gathen, Parallel algorithms for algebraic problems, SIAM J. Comput.,

13 (1984), pp. 802–824.[G86] J. von zur Gathen, Parallel arithmetic computations: A survey, in Mathematical

Foundations of Computer Science, Lecture Notes in Comput. Sci. 233, Springer,Berlin, 1986, pp. 93–112.

[GKO95] I. Gohberg, T. Kailath, and V. Olshevsky, Fast Gaussian elimination with par-tial pivoting for matrices with displacement structure, Math. Comp., 64 (1995),pp. 1557–1576.

[GL89/96] G. H. Golub and C. F. Van Loan, Matrix Computations, Johns Hopkins UniversityPress, Baltimore, MD, 1989 (2nd ed.), 1996 (3rd ed.).

[H91] S. Haykin, Adaptive Filter Theory, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ,1991.

[H95] G. Heinig, Inversion of generalized Cauchy matrices and other classes of structuredmatrices, in Linear Algebra for Signal Processing, IMA Vol. Math. Appl. 69,Springer, New York, 1995, pp. 95–114.

[IR82] K. Ireland and M. Rosen, A Classical Introduction to Modern Number Theory,Springer, Berlin, 1982.

[J92] J. Ja Ja, An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA,1992.

[K74] T. Kailath, A view of three decades of linear filtering theory, IEEE Trans. Inform.Theory, 20 (1974), pp. 146–181.

[K87] T. Kailath, Signal processing applications of some moment problems, in Momentsin Mathematics, Proc. Sympos. App. Math. 37, AMS, Providence, RI, 1987, pp.71–100.

[K95] E. Kaltofen, Analysis of Coppersmith’s block Wiedemann algorithm for the parallelsolution of sparse linear systems, Math. Comput., 64 (1995), pp. 777–806.

[KAGKA89] R. King, M. Ahmadi, R. Gorgui-Naguib, A. Kwabwe, and M. Azimi-Sadjadi,Digital Filtering in One and Two Dimensions: Design and Applications, PlenumPress, New York, 1989.

[KKM79] T. Kailath, S.-Y. Kung, and M. Morf, Displacement ranks of matrices and linearequations, J. Math. Anal. Appl., 68 (1979), pp. 395–407.

[KLM78] T. Kailath, L. Ljung, and M. Morf, A new approach to the dertemination of Fred-holm resolvents of nondisplacement kernels, in Topics in Functional Analysis, I.Gohberg and M. Kac, eds., Academic Press, New York, 1978, pp. 169–184.

[KP91] E. Kaltofen and V. Y. Pan, Processor efficient parallel solution of linear systemsover an abstract field, in Proceedings of the 3rd Annual ACM Symposium onParallel Algorithms and Architectures, ACM, New York, 1991, pp. 180–191.

[KP92] E. Kaltofen and V. Y. Pan, Processor-efficient parallel solution of linear systemsII. The positive characteristic and singular cases, in Proceedings of 33rd AnnualIEEE Symposium on Foundations of Computer Science, IEEE Computer Society,Los Alamitos, CA, 1992, pp. 714–723.

[KP94] E. Kaltofen and V. Y. Pan, Parallel solution of Toeplitz and Toeplitz-like linearsystems over fields of small positive characteristic, in Proceedings of the FirstInternational Symposium on Parallel Symbolic Computation, Lecture Notes Ser.Comput. 5, World Scientific, Singapore, 1994, pp. 225–233.

[KR90] R. Karp and V. Ramachandran, A survey of parallel algorithms for shared memorymachines, in Handbook for Theoretical Computer Science, J. van Leeuwen, ed.,North-Holland, Amsterdam, 1990, pp. 869–941.

[KS91] E. Kaltofen and B. D. Saunders, On Wiedemann’s method for solving sparselinear systems, Proc. AAECC-9, Lecture Notes in Comput. Sci. 539, Springer,Berlin, 1991, pp. 29–38.

[KVM78] T. Kailath, A. Vieira, and M. Morf, Inverses of Toeplitz operators, innovations,and orthogonal polynomials, SIAM Rev., 20 (1978), pp. 106–119.

[L-AK84] H. Lev-Ari and T. Kailath, Lattice filter parametrization and modelling of non-stationary processes, IEEE Trans. Inform. Theory, IT-30 (1984), pp. 2–16.

[L-AKC84] H. Lev-Ari, T. Kailath, and J. Cioffi, Least squares adaptive lattice and transver-sal filters; a unified geometrical theory, IEEE Trans. Inform. Theory, IT-30(1984), pp. 222–236.

[Le92] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays,

1124 VICTOR Y. PAN

Trees and Hypercubes, Morgan Kaufmann, San Mateo, CA, 1992.[LRT79] R. J. Lipton, D. Rose, and R. E. Tarjan, Generalized nested dissection, SIAM J.

Numer. Anal., 16 (1979), pp. 346–358.[Ma75] J. Makhoul, Linear prediction: A tutorial review, Proc. IEEE, 63 (1975), pp. 561–

580.[MC79] R. T. Moenck and J. H. Carter, Approximate algorithms to derive exact solutions

to systems of linear equations, in Symbolic and Algebraic Computation, LectureNotes in Comput. Sci. 72, Springer, Berlin, 1979, pp. 63–73.

[Morf74] M. Morf, Fast Algorithms for Multivariable Systems, Ph.D. Thesis, Stanford Uni-versity, Stanford, CA, 1974.

[Morf80] M. Morf, Doubling algorithms for Toeplitz and related equations, in Proceedings ofthe IEEE International Conference on Acoustics, Speech and Signal Processing,IEEE Computer Society, Los Alamitos, CA, 1980, pp. 954–959.

[MP98] B. Mourrain and V. Y. Pan, Asymptotic acceleration of solving multivariate poly-nomial systems of equations, in Proceedings of the ACM Symposium on Theoryof Computing, ACM, New York, 1998, pp. 488–496.

[MRK88] G. L. Miller, V. Ramachandran, and E. Kaltofen, Efficient parallel evaluationof straight-line code and arithmetic circuits, SIAM J. Comput., 17 (1988), pp.687–695.

[Mu81] B. R. Musicus, Levinson and Fast Choleski Algorithms for Toeplitz and AlmostToeplitz Matrices, Internal Report, Lab. of Electronics, M.I.T., Cambridge, MA,1981.

[OP98] V. Olshevsky and V. Y. Pan, A unified superfast algorithm for boundary rationaltangential interpolation problem and for inversion and factorization of densestructured matrices, in Proceedings of the 39th Annual IEEE Symposium onFoundations of Computer Science, IEEE Computer Society, Los Alamitos, CA,1998, pp. 192–201.

[OP99] V. Olshevsky and V. Y. Pan, Polynomial and rational interpolation and multipointevaluation (with structured matrices), in Proceedings of the 26th InternationalColloquium on Automata, Languages and Programming (ICALP 99), LectureNotes in Comput. Sci. 1644, J. Wiedermann, P. van Emde Boas, and M. Nielsen,eds., Springer, Berlin, 1999, pp. 585–594.

[P85] V. Y. Pan, Fast and efficient parallel algorithms for the exact inversion of integermatrices, in Proceedings of the 5th Annual Conference on Foundations of Soft-ware Technology and Theoretical Compututer Science, Lecture Notes in Comput.Sci. 206, Springer-Verlag, New York, 1985, pp. 504–521.

[P87] V. Y. Pan, Complexity of parallel matrix computations, Theoret. Comput. Sci., 54(1987), pp. 65–85.

[P90] V. Y. Pan, Computations with dense structured matrices, Math. Comp., 55 (1990),pp. 179–190.

[P91] V. Y. Pan, Complexity of algorithms for linear systems of equations, in ComputerAlgorithms for Solving Linear Algebraic Equations (The State of the Art), E.Spedicato, ed., NATO Adv. Sci. Inst. Ser. F Comput. and Systems Sci. 77,Springer, Berlin, 1991, pp. 27–56.

[P92] V. Y. Pan, Parametrization of Newton’s iteration for computations with structuredmatrices and applications, Comput. Math. Appl., 24 (1992), pp. 61–75.

[P92a] V. Y. Pan, Complexity of computations with matrices and polynomials, SIAM Rev.,34 (1992), pp. 225–262.

[P92b] V. Y. Pan, Parallel solution of Toeplitz-like linear systems, J. Complexity, 8 (1992),pp. 1–21.

[P93] V. Y. Pan, Decreasing the displacement rank of a matrix, SIAM J. on Matrix Anal.,Appl. 14 (1993), pp. 118–121.

[P93a] V. Y. Pan, Concurrent iterative algorithm for Toeplitz-like linear systems, IEEETrans. Parallel and Distributed Systems, 4 (1993), pp. 592–600.

[P93b] V. Y. Pan, Parallel solution of sparse linear and path systems, in Synthesis of ParallelAlgorithms, J.H. Reif, ed., Morgan Kaufmann, San Mateo, CA, 1993, pp. 621–678.

[P95] V. Y. Pan, Optimal (up to polylog factors) sequential and parallel algorithms forapproximating complex polynomial zeros, in Proceedings of the 27th AnnualACM Symposium on Theory of Computing, ACM, New York, 1995, pp. 741–750.

[P96] V. Y. Pan, A new approach to parallel computation of polynomial GCD and to

PARALLEL ALGORITHMS FOR TOEPLITZ-LIKE MATRICES 1125

related parallel computations over abstract fields, in Proceedings of the Sev-enth Annual ACM–SIAM Symposium on Discrete Algorithms, ACM, New York,SIAM, Philadelphia, PA, 1996, pp. 518–527.

[P96a] V. Y. Pan, Optimal and nearly optimal algorithms for approximating polynomialzeros, Comput. Math. Appl., 31 (1996), pp. 97–138.

[P96b] V. Y. Pan, Parallel computation of polynomial GCD and some related parallel com-putations over abstract fields, Theoret. Comput. Sci., 162 (1996), pp. 173–223.

[P96c] V. Y. Pan, Effective parallel computations with Toeplitz and Toeplitz-like matricesfilled with integers, in The Mathematics of Numerical Analysis (Park City, Utah,1995), Lectures in Appl. Math. 32, J. Renegar, M. Shub, and S. Smale, eds.,Amer. Math. Soc., Providence, RI, 1996, pp. 591–641.

[P97] V. Y. Pan, Solving a polynomial equation: Some history and recent progress, SIAMRev., 39 (1997), pp. 187–220.

[P00] V. Y. Pan, Nearly optimal computations with structured matrices, in Proceedings ofthe 11th Annual ACM–SIAM Symposium on Discrete Algorithms, ACM, NewYork, SIAM, Philadelphia, 2000, pp. 953–962.

[P00a] V. Y. Pan, Matrix structure, polynomial arithmetic, and erasure-resilient encod-ing/decoding, to appear in Proceedings of the ACM International Symposiumon Symbolic and Algebraic Computation, ACM, New York, 2000.

[PBRZ99] V. Y. Pan, S. Branham, R. Rosholt, and A. Zheng, Newton’s iteration for struc-tured matrices, in Fast Reliable Algorithms for Matrices with Structure, SIAM,Philadelphia, PA, 1999, pp. 189–210.

[PP95] V. Y. Pan, F. P. Preparata, Work-preserving speed-up of parallel matrix compu-tations, SIAM J. Comput., 24 (1995), pp. 811–821.

[PR89] V. Y. Pan and J. Reif, Fast and efficient parallel solution of dense linear systems,Comput. Math. Appl., 17 (1989), pp. 1481–1491.

[PR91] V. Y. Pan and J. Reif, The parallel computation of the minimum cost paths ingraphs by stream contraction, Inform. Process. Lett., 40 (1991), pp. 79–83.

[PR93] V. Y. Pan and J. Reif, Fast and efficient parallel solution of sparse linear systems,SIAM J. Comput., 22 (1993), pp. 1227–1250.

[PSLT93] V. Y. Pan, A. Sadikou, E. Landowne, and O. Tiga, A new approach to fastpolynomial interpolation and multipoint evaluation, Comput. Math. Appl., 25(1993), pp. 25–30.

[PZHY97] V. Y. Pan, A. Zheng, X. Huang, and Y. Yu, Fast multipoints polynomial evalua-tion and interpolartion via computations with structured matrices, Ann. Numer.Math., 4 (1997), pp. 483–510.

[Q94] M.J. Quinn, Parallel Computing: Theory and Practice, McGraw-Hill, New York,1994.

[R93] J. Reif, An O(n log3 n) algorithm for the real root problem, in Proceedings of the 34thAnnual IEEE Symposium on Foundations of Computer Science, IEEE ComputerSociety, Los Alamitos, CA, 1993, pp. 626–635.

[R95] J. Reif, Work efficient parallel solution of Toeplitz systems and polynomial GCD,in Proceedings of the 27th Annual ACM Symposium on Theory of Computing,ACM, New York, 1995, pp. 751–761.

[RT90] J. Reif and S.R. Tate, Optimal size integer division circuits, SIAM J. Comput., 19(1990), pp. 912–924.

[St69] V. Strassen, Gaussian elimination is not optimal, Numer. Math., 13 (1969), pp.354–356.

[Ste94] W. F. Stewart, Introduction to the Numerical Solution of Markov Chains, Prince-ton University Press, Princeton, NJ, 1994.

[VSBR83] L. Valiant, S. Skyum, S. Berkowitz, and C. Rackoff, Fast parallel computationof polynomials using few processors, SIAM J. Comput., 12 (1983), pp. 641–644.

[W65] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK,1965.

Recommended