Iterative Methods andCombinatorial Preconditioners

transcript

2003-09-10

Maverick Woo

Iterative Methods andIterative Methods andCombinatorial PreconditionersCombinatorial PreconditionersIterative Methods andIterative Methods andCombinatorial PreconditionersCombinatorial Preconditioners

This talk is not about…

CreditsCreditsCreditsCreditsSolving Symmetric Diagonally-

Dominant Systems By Preconditioning

Bruce Maggs, Gary Miller, Ojas Parekh, R. Ravi, mw

Combinatorial Preconditioners for Large, Sparse, Symmetric, Diagonally-

Dominant Linear SystemsKeith Gremban (CMU PhD 1996)

Linear SystemsLinear SystemsLinear SystemsLinear Systems

n by n matrix n by 1 vector

A useful way to do matrix algebra in your head:Matrix-vector multiplication = Linear combination of matrix columns

Ax = bknown unknown known

Matrix-Vector MultiplicationMatrix-Vector MultiplicationMatrix-Vector MultiplicationMatrix-Vector Multiplication

Using BTAT = (AB)T,xTA should also be interpreted asa linear combination of the rows of A.

@11 12 1321 22 2331 32 33

@112131

@122232

@132333

@11a+ 12b+ 13c21a+ 22b+ 23c31a+ 32b+ 33c

How to Solve?How to Solve?How to Solve?How to Solve?

Find A-1

Guess x repeatedly until we guess a solutionGaussian Elimination

Ax = b

Strassen had a fastermethod to find A-1

Large Linear Systems in The Real World

Circuit Voltage ProblemCircuit Voltage ProblemCircuit Voltage ProblemCircuit Voltage Problem

Given a resistive network and the net current flow at each terminal, find the voltage at each node.

Conductance is the reciprocal of resistance. It’s unit is siemens (S).

A node with an external connection is a terminal. Otherwise it’s a junction.

Kirchhoff’s Law of CurrentKirchhoff’s Law of CurrentKirchhoff’s Law of CurrentKirchhoff’s Law of Current

At each node,net current flowing at each node = 0.

Consider v1. We have

which after regrouping yields

I=VCI=VC

3(v1 ¡ v2) + (v1 ¡ v3) = 2:

2+ 3(v2 ¡ v1) + (v3 ¡ v1) = 0;

Summing UpSumming UpSumming UpSumming Up

SolvingSolvingSolvingSolving

Did I say “LARGE”?Did I say “LARGE”?Did I say “LARGE”?Did I say “LARGE”?

Imagine this being the power grid of America.

LaplaciansLaplaciansLaplaciansLaplacians

Given a weighted, undirected graph G = (V, E), we can represent it as a Laplacian matrix.

LaplaciansLaplaciansLaplaciansLaplacians

Laplacians have many interesting properties, such as

Diagonals ¸ 0 denotes total incident weights

Off-diagonals < 0 denotes individual edge weights

Row sum = 0Symmetric

Net Current FlowNet Current FlowNet Current FlowNet Current Flow

LemmaSuppose an n by n matrix A is the Laplacian of a resistive network G with n nodes.If y is the n-vector specifying the voltage at each node of G, then Ay is the n-vector representing the net current flow at each node.

Power DissipationPower DissipationPower DissipationPower Dissipation

LemmaSuppose an n by n matrix A is the Laplacian of a resistive network G with n nodes.If y is the n-vector specifying the voltage at each node of G, then yTAy is the total power dissipated by G.

SparsitySparsitySparsitySparsity

Laplacians arising in practice are usually sparse.The i-th row has (d+1) nonzeros if vi has d neighbors.

Sparse MatrixSparse MatrixSparse MatrixSparse Matrix

An n by n matrix is sparse when there are O(n) nonzeros.

A reasonably-sized power grid has way more junctions and each junction has only a couple of neighbors.3 2

Had Gauss owned a supercomputer…

(Would he really work on Gaussian Elimination?)

A Model ProblemA Model ProblemA Model ProblemA Model Problem

Let G(x, y) and g(x, y) be continuous functions defined in R and S respectively, where R and S are respectively the region and the boundary of the unit square (as in the figure).

A Model ProblemA Model ProblemA Model ProblemA Model Problem

We seek a function u(x, y) that satisfies

Poisson’s equation in R

and the boundary condition in S.

@2u@x2

+@2u@y2 = G(x;y)

u(x;y) = g(x;y)

If g(x,y) = 0, this iscalled a Dirichlet

boundary condition.

If g(x,y) = 0, this iscalled a Dirichlet

boundary condition.

DiscretizationDiscretizationDiscretizationDiscretization

Imagine a uniform grid with a small spacing h.

Five-Point DifferenceFive-Point DifferenceFive-Point DifferenceFive-Point Difference

Replace the partial derivatives by difference quotients

The Poisson’s equation now becomes

@2u=@y2 s [u(x;y + h) + u(x;y ¡ h) ¡ 2u(x;y)]=h2

@2u=@x2 s [u(x + h;y) + u(x ¡ h;y) ¡ 2u(x;y)]=h2

4u(x;y) ¡ u(x +h;y) ¡ u(x ¡ h;y)

¡ u(x;y+h) ¡ u(x;y ¡ h) = ¡ h2G(x;y)

Exercise:Derive the 5-pt diff. eqt. from first principle (limit).

For each point in For each point in RR

The total number of equations is .

Now write them in the matrix form, we’ve got one BIG linear system to solve!

4u(x;y) ¡ u(x +h;y) ¡ u(x ¡ h;y)

¡ u(x;y+h) ¡ u(x;y ¡ h) = ¡ h2G(x;y)

4u(x;y) ¡ u(x +h;y) ¡ u(x ¡ h;y)

¡ u(x;y+h) ¡ u(x;y ¡ h) = ¡ h2G(x;y)

(1h¡ 1)2

An ExampleAn ExampleAn ExampleAn Example

Consider u3,1, we have

which can be rearranged to

4u(x;y) ¡ u(x +h;y) ¡ u(x ¡ h;y)

¡ u(x;y+h) ¡ u(x;y ¡ h) = ¡ h2G(x;y)

u31 u41u21

4u(3;1) ¡ u(4;1) ¡ u(2;1)

¡ u(3;2) ¡ u(3;0) = ¡ G(3;1)

4u(3;1) ¡ u(2;1) ¡ u(3;2)

= ¡ G(3;1) + u(4;1) + u(3;0)

An ExampleAn ExampleAn ExampleAn Example

Each row and column can have a maximum of 5 nonzeros.

4u(x;y) ¡ u(x +h;y) ¡ u(x ¡ h;y)

¡ u(x;y+h) ¡ u(x;y ¡ h) = ¡ h2G(x;y)

Sparse Matrix AgainSparse Matrix AgainSparse Matrix AgainSparse Matrix Again

Really, it’s rare to see large dense matrices arising from applications.

Laplacian???Laplacian???Laplacian???Laplacian???

I showed you a system that is not quite Laplacian.We’ve got way too many boundary points in a 3x3 example.

Making It LaplacianMaking It LaplacianMaking It LaplacianMaking It Laplacian

We add a dummy variable and force it to zero.(How to force? Well, look at the rank of this matrix first…)

Sparse Matrix Sparse Matrix RepresentationRepresentationSparse Matrix Sparse Matrix RepresentationRepresentation

A simple schemeAn array of columns, where each column Aj is a linked-list of tuples (i, x).

Solving Sparse SystemsSolving Sparse SystemsSolving Sparse SystemsSolving Sparse Systems

Gaussian Elimination again?Let’s look at one elimination step.

Gaussian Elimination introduces fill.

FillFillFillFillOf course it depends on the elimination order.

Finding an elimination order with minimal fill is hopelessGarey and Johnson-GT46, Yannakakis SIAM JADM 1981O(log n) ApproximationSudipto Guha, FOCS 2000Nested Graph Dissection and Approximation Algorithms(n log n) lower bound on fill(Maverick still has not dug up the paper…)

When Fill Matters…When Fill Matters…When Fill Matters…When Fill Matters…

Find A-1

Guess x repeatedly until we guessed a solutionGaussian Elimination

Ax = b

Inverse Of Sparse MatricesInverse Of Sparse MatricesInverse Of Sparse MatricesInverse Of Sparse Matrices

…are not necessarily sparse either!

And the winner is…And the winner is…And the winner is…And the winner is…

Find A-1

Guess x repeatedly until we guessed a solutionGaussian Elimination

Ax = b Can we be so lucky?

Can we be so lucky?

Iterative Methods

Checkpoint• How large linear system actually

arise in practice• Why Gaussian Elimination may not

be the way to go

The Basic IdeaThe Basic IdeaThe Basic IdeaThe Basic Idea

Start off with a guess x (0).Using x (i ) to compute x (i+1) until

converge.

We hopethe process converges in a small number of iterationseach iteration is efficient

Residual

The RF Method [Richardson, The RF Method [Richardson, 1910]1910]The RF Method [Richardson, The RF Method [Richardson, 1910]1910]

Domain Range

x(i+1) = x(i ) ¡ (Ax(i ) ¡ b)

x(i+1)

Correct

Why should it converge at Why should it converge at all?all?Why should it converge at Why should it converge at all?all?

Domain Range

x(i+1)

Correct

x(i+1) = x(i ) ¡ (Ax(i ) ¡ b)

It only converges when…It only converges when…It only converges when…It only converges when…

TheoremA first-order stationary iterative

method

converges iff

x(i+1) = x(i ) ¡ (Ax(i ) ¡ b)

½(G) < 1.

(A) is the maximumabsolute eigenvalue of A(A) is the maximum

absolute eigenvalue of A

x(i+1) = Gx(i ) + k

Fate?Fate?Fate?Fate?

Once we are given the system, we do not have any control on A and b.

How do we guarantee even convergence?

Ax = b

PreconditioningPreconditioningPreconditioningPreconditioning

Instead of dealing with A and b,we now deal with B-1A and B-1b.

B -1Ax = B -1b

The word “preconditioning”originated with Turing in 1948,

but his idea was slightly different.

The word “preconditioning”originated with Turing in 1948,

but his idea was slightly different.

Preconditioned RFPreconditioned RFPreconditioned RFPreconditioned RF

Since we may precompute B-1b by solving By = b, each iteration is dominated by computing B-

1Ax(i), which is a multiplication step Ax(i) and a direct-solve step Bz = Ax(i).

Hence a preconditioned iterative method is in fact a hybrid.

x(i+1) = x(i ) ¡ (B -1Ax(i ) ¡ B -1b)

The Art of PreconditioningThe Art of PreconditioningThe Art of PreconditioningThe Art of Preconditioning

We have a lot of flexibility in choosing B.Solving Bz = Ax(i) must be fastB should approximate A well for a low iteration count

Trivial

What’s the

point?

ClassicsClassicsClassicsClassics

JacobiLet D be the diagonal sub-matrix of A.Pick B = D.

Gauss-SeidelLet L be the lower triangular part of A w/ zero diagonalsPick B = L + D.

x(i+1) = x(i ) ¡ (B -1Ax(i ) ¡ B -1b)

““Combinatorial”Combinatorial”““Combinatorial”Combinatorial”

We choose to measure how well B approximates A

by comparing combinatorial properties of (the graphs represented by) A and B.

Hence the term “Combinatorial Preconditioner”.

Questions?Questions?Questions?Questions?

Graphs as Matrices

Edge Vertex Incidence Edge Vertex Incidence MatrixMatrixEdge Vertex Incidence Edge Vertex Incidence MatrixMatrix

Given an undirected graph G = (V, E), let be a |E| £ |V | matrix of {-1, 0, 1}.

For each edge (u, v), set e,u to -1 and e,v to 1. Other entries are all zeros.

Edge Vertex Incidence Edge Vertex Incidence MatrixMatrixEdge Vertex Incidence Edge Vertex Incidence MatrixMatrix

a b c d e f g h i j

Weighted GraphsWeighted GraphsWeighted GraphsWeighted Graphs

Let W be an |E| £ |E| diagonalmatrix where We,e is the weightof the edge e.

LaplacianLaplacianLaplacianLaplacian

The Laplacian of G is defined tobe TW .

Properties of LaplaciansProperties of LaplaciansProperties of LaplaciansProperties of Laplacians

Let L = TW .“Prove by example”

A matrix A is Positive SemiDefinite if

Since L = TW , it’s easy to see that for all x

8x;xTAx ¸ 0:

xT(¡TW¡ )x = (W12 ¡ x)T(W

12 ¡ x) ¸ 0:

LaplacianLaplacianLaplacianLaplacian

The Laplacian of G is defined tobe TW .

Graph Embedding

A Primer

Graph EmbeddingGraph EmbeddingGraph EmbeddingGraph Embedding

Vertex in G Vertex in HEdge in G Path in H

Guest Host

DilationDilationDilationDilation

For each edge e in G, define dil(e) to be the number of edges in its corresponding path in H.

Guest Host

CongestionCongestionCongestionCongestion

For each edge e in H, define cong(e) to be the number of embedding paths that uses e.Guest Host

Support Theory

DisclaimerDisclaimerDisclaimerDisclaimer

The presentation to follow is only “essentially correct”.

SupportSupportSupportSupport

DefinitionThe support required by a matrix B for a matrix A,both n by n in size, is defined as ¾(A=B) := minf¿ 2 Rj8x;xT(¿B ¡ A)x ¸ 0g

A / B)

Think of B supportingA at the bottom.

A / B)

Think of B supportingA at the bottom.

Support With LaplaciansSupport With LaplaciansSupport With LaplaciansSupport With Laplacians

Life is Good when the matrices are Laplacians.Remember the resistive circuit analogy?

Power DissipationPower DissipationPower DissipationPower Dissipation

LemmaSuppose an n by n matrix A is the Laplacian of a resistive network G with n nodes.If y is the n-vector specifying the voltage at each node of G, then yTAy is the total power dissipated by G.

Circuit-SpeakCircuit-SpeakCircuit-SpeakCircuit-Speak

Read this loud in circuit-speak:“The support for A by B is the minimum number so thatfor all possible voltage settings, copies of B burn at least as much energy asone copy of A.”

¾(A=B) := minf¿ 2 Rj8x;xT(¿B ¡ A)x ¸ 0g

Congestion-Dilation LemmaCongestion-Dilation LemmaCongestion-Dilation LemmaCongestion-Dilation Lemma

Given an embedding from G to H,

TransitivityTransitivityTransitivityTransitivity

Pop QuizFor Laplacians, prove this in circuit-

speak.

¾(A=C) · ¾(A=B) ¢¾(B=C)

Generalized Condition Generalized Condition NumberNumberGeneralized Condition Generalized Condition NumberNumber

DefinitionThe generalized condition number of a pair of PSD matrices is

A is Positive Semi-definiteiff 8x, xTAx ¸ 0.

Preconditioned Conjugate Preconditioned Conjugate GradientGradientPreconditioned Conjugate Preconditioned Conjugate GradientGradient

Solving the system Ax = b using PCG with preconditioner B requires at most

iterations to find a solution such that

Convergence rate is dependent on the actual iterative method used.

Support Trees

Information FlowInformation FlowInformation FlowInformation FlowIn many iterative methods, the only operation using A directly is to compute Ax(i) in each iteration.

Imagine each node is an agent maintaining its value.The update formula specifies how each agent should update its value for round (i+1) given all the values in round i.

x(i+1) = x(i ) ¡ (Ax(i ) ¡ b)

The Problem With The Problem With MultiplicationMultiplicationThe Problem With The Problem With MultiplicationMultiplication

Only neighbors can “communicate” in a multiplication, which happens once per iteration.

Diameter As A Natural Lower Diameter As A Natural Lower BoundBoundDiameter As A Natural Lower Diameter As A Natural Lower BoundBound

In general, for a node to settle on its final value, it needs to “know” at least the initial values of the other nodes.

Diameter is the maximumshortest path distance

between any pair of nodes.

Diameter is the maximumshortest path distance

between any pair of nodes.

Preconditioning As Preconditioning As ShortcuttingShortcuttingPreconditioning As Preconditioning As ShortcuttingShortcutting

By picking B carefully, we can introduce shortcuts for faster communication.

But is it easy to find shortcutsin a sparse graph to reduceits diameter?

x(i+1) = x(i ) ¡ (B -1Ax(i ) ¡ B -1b)

Square MeshSquare MeshSquare MeshSquare Mesh

Let’s pick the complete graph induced on all the mesh points.

Mesh: O(n) edgesComplete graph: O(n2) edges

Bang!Bang!Bang!Bang!

So exactly how do we propose to solve a dense n by n system faster than a sparse one?

B can have at most O(n) edges, i.e., sparse…

Support TreeSupport TreeSupport TreeSupport TreeBuild a Steiner tree to introduce

shortcuts!

If we pick a balancedtree, no nodes will be farther than O(log n)hops away.

Need to specify weightson the tree edges

Mixing SpeedMixing SpeedMixing SpeedMixing Speed

The speed of communication is proportional to the corresponding coefficients on the paths between nodes.

Setting WeightsSetting WeightsSetting WeightsSetting Weights

The nodes should be able to talk at least as fast as they could without shortcuts.

How about setting all the weights to 1?

Recall PCG’s Convergence Recall PCG’s Convergence RateRateRecall PCG’s Convergence Recall PCG’s Convergence RateRate

Solving the system Ax = b using PCG with preconditioner B requires at most

iterations to find a solution such that

Size MattersSize MattersSize MattersSize Matters

How big is the preconditioner matrix B?(2n-1) by (2n-1)

The major contribution(among another) of our paper is deal with thisdisparity in size.

The Search Is HardThe Search Is HardThe Search Is HardThe Search Is Hard

Finding the “right” preconditioner is really a tradeoffSolving Bz = Ax(i) must be fastB should approximate A well for a low iteration count

It could very well be harder than we think.

How to Deal With Steiner Nodes?

The Trouble of Steiner NodesThe Trouble of Steiner NodesThe Trouble of Steiner NodesThe Trouble of Steiner Nodes

Computation

Definitions, e.g.,¾(B=A) := minf¿ 2 Rj8x;xT(¿A ¡ B)x ¸ 0g

x(i+1) = x(i ) ¡ (B -1Ax(i ) ¡ B -1b)

Generalized SupportGeneralized SupportGeneralized SupportGeneralized Support

Let where W is n by n.

Then (B/A) is defined to be

where y = -T -1Ux.

µT UUT W

minf¿ 2 Rj8x;¿xTAx ¸µ

Circuit-SpeakCircuit-SpeakCircuit-SpeakCircuit-Speak

Read this loud in circuit-speak:“The support for B by A is the minimum number so thatfor all possible voltage settings at

the terminals, copies of A burn at least as much energy asone copy of B.”

¾(B=A) := minf¿ 2 Rj8x;¿xTAx ¸µ

Thomson’s PrincipleThomson’s PrincipleThomson’s PrincipleThomson’s Principle

Fix the voltages at theterminals. The voltages at the junctions will be set such that the total power dissipation in the circuit is minimized.

¾(B=A) := minf¿ 2 Rj8x;¿xTAx ¸µ

Racke’s Decomposition Tree

Laminar DecompositionLaminar DecompositionLaminar DecompositionLaminar Decomposition

A laminar decomposition naturally defines a tree.

Racke, FOCS 2002Racke, FOCS 2002Racke, FOCS 2002Racke, FOCS 2002

Given a graph G of n nodes, there exists a laminar decomposition tree T with all the “right” propertiesas a preconditioner for G.

Except his advisor didn’t tell him about this…

For More DetailsFor More DetailsFor More DetailsFor More Details

Our paper is available onlinehttp://www.cs.cmu.edu/~maverick/

Our contributionsAnalysis of Racke’s decomposition tree as a preconditionerProvided tools for reasoning support between Laplacians of different dimensions

The Search Is NOT OverThe Search Is NOT OverThe Search Is NOT OverThe Search Is NOT Over

Future DirectionsRacke’s tree can take exponential time to find

Many recent improvements, but not quite there yet

Insisting on balanced tree can hurt in many easy cases

Iterative Methods andCombinatorial Preconditioners

Documents