+ All Categories
Home > Documents > Random walk: notes -  · 2017. 12. 5. · The random walk on an (undirected) graph is...

Random walk: notes -  · 2017. 12. 5. · The random walk on an (undirected) graph is...

Date post: 26-Jan-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
42
Random walk: notes aszl´oLov´ asz December 5, 2017 Contents 1 Basics 2 1.1 Random walks and finite Markov chains ..................... 2 1.2 Matrices of graphs ................................. 4 1.3 Stationary distribution ............................... 5 1.4 Harmonic functions ................................. 6 2 Times 9 2.1 Return time ..................................... 9 2.2 Hitting time ..................................... 11 2.3 Commute time ................................... 15 2.4 Cover time ..................................... 16 2.5 Universal traverse sequences ............................ 17 3 Mixing 19 3.1 Eigenvalues ..................................... 20 3.2 Coupling ....................................... 22 3.2.1 Random coloring of a graph ........................ 23 3.3 Conductance .................................... 25 4 Stopping rules 30 4.1 Exit frequencies ................................... 30 4.2 Mixing and ε-mixing ................................ 34 5 Applications 36 5.1 Volume computation ................................ 36 5.1.1 What is a convex body? .......................... 36 5.2 Lower bounds on the complexity ......................... 37 5.2.1 Monte-Carlo algorithms .......................... 38 5.2.2 Measurable Markov chains ......................... 39 1
Transcript
  • Random walk: notes

    László Lovász

    December 5, 2017

    Contents

    1 Basics 2

    1.1 Random walks and finite Markov chains . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Matrices of graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.4 Harmonic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Times 9

    2.1 Return time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Hitting time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.3 Commute time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.4 Cover time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.5 Universal traverse sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3 Mixing 19

    3.1 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.2 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.2.1 Random coloring of a graph . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.3 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4 Stopping rules 30

    4.1 Exit frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.2 Mixing and ε-mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5 Applications 36

    5.1 Volume computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    5.1.1 What is a convex body? . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    5.2 Lower bounds on the complexity . . . . . . . . . . . . . . . . . . . . . . . . . 37

    5.2.1 Monte-Carlo algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.2.2 Measurable Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . 39

    1

  • 5.2.3 Isoperimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.2.4 The ball walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.3 Random spanning tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    1 Basics

    1.1 Random walks and finite Markov chains

    Let G = (V,E) be a connected finite graph with n ≥ 2 vertices and m edges. We usuallyassume that V = {1, . . . , n}. Let a(i, j) denote the number of edges connecting i and j.

    A random walk on G is an (infinite) sequence of random vertices v0, v1, v2, . . ., where v0

    is chosen from some given initial probability distribution σ0 on V (often concentrated on a

    single point) and for each t ≥ 0, vt+1 is obtained by choosing and edge from the uniformdistribution on the set of edges incident with vt, and moving to its other endpoint. We denote

    σt the distribution of vt: σti = P(vt = i). We denote by σk the vector (σki : i ∈ V ).

    If we are at node i, then the probability of moving to node j is pij = a(i, j)/n. In the

    case of a simple graph,

    pij =

    {1/d(i) if ij ∈ E(G),0, otherwise.

    (1)

    Note that∑j∈V

    pij = 1 (i ∈ V ). (2)

    A nonnegative matrix P = (pij)ni,j=1 satisfying (2) defines a finite Markov chain. A walk of the

    chain is the random sequence (v0, v1, v2, . . .) where v0 is chosen from some initial probability

    distribution σ0 and vt+1 is chosen from the probability distribution (pvt,j : j = 1, . . . , n).

    All these choices are made independently.

    We call the Markov chain irreducible, if for every S ⊆ V , S 6= ∅, V there are i ∈ S andj ∈ V \ S with pij > 0. For the random walk on a graph, this just means that the graph isconnected.

    Proposition 1.1 In every irreducible Markov chain, every node is visited infinitely often

    with probability 1.

    We conclude with some examples.

    Example 1.2 (Waiting Problem) Let A1, A2, . . . be a sequence of independent events,

    each with probability p. Let AN be the first that occurs. Then E(N) = 1/p. Proof: E(N) =

    p · 1 + (1− p)(1 + E(N).

    2

  • This can be modeled as a Markov chain with two states W (for Wait) and S (for Success),

    where

    pW,W = 1− p, pW,S = p,

    pS,W = 0, pS,S = 1.

    (The last two values are given only to have a full Markov chain, since we don’t care what

    happens after reaching S.)

    An alternative proof would use the formula

    P(N = t) = (1− p)t−1p.

    From this one could easily compute the expectation of N2 and the variance of N:

    E(N2) =p+ 1

    2p2, Var(N) =

    p− 1p2

    .

    Instead of the independence of the events Ai, it would suffice to assume that P(Ai | A1 ∧· · · ∧ Ai−1) = p. If we only assume that P(Ai | A1 ∧ · · · ∧ Ai−1) ≥ p, then the inequalitiesE(N) ≤ 1/p and E(N2) ≤ (p+ 1)/(2p2) still follow.

    Example 1.3 (Gambler’s Ruin) A gambler is betting on coin flips; so at every turn, it

    loses one dollar or gains one dollar. He starts with k dollars, and sets a target wealth of

    n > k dollars. He quits if he either reaches n dollars (a win) or loses all his money (a loss).

    What is his probability of winning?

    This can be phrased as a random walk on a path with nodes {0, 1, . . . , n}, starting atnode k. Let f(k) be the probability of hitting n before 0. Then f(n) = 1, f(0) = 0, and

    f(k) =1

    2f(k − 1) + 1

    2f(k + 1) (1 ≤ k ≤ n− 1).

    So the numbers f(k) form an arithmetic progression, and hence f(k) = k/n.

    Example 1.4 (Coupon Collector Problem) A type of cereal box contains one of n dif-

    ferent coupons, each with the same probability. How many boxes do you have to open, in

    expectation, in order to collect all of the different coupons?

    Let Ti denote the first time when i coupons have been collected. So T1 = 0 < T2 =

    1 < . . . < Tn, and we want to determine E(Tn). The difference Tk+1 −Tk is the number ofboxes you open before finding a new coupon, when having k coupons already. This event has

    probability (n − k)/n, independently of the previous steps. Hence by the Waiting Problem(Example 1.2),

    E(Tk+1 −Tk) =n

    n− k,

    3

  • and so the total expected time is

    E(Tn) =n−1∑k=0

    E(Tk+1 −Tk) =n−1∑k=0

    n

    n− k= nhar(n) ∼ n lnn,

    where

    har(n) = 1 +1

    2+

    1

    3+ · · ·+ 1

    n.

    This can be modeled as a random walk on a complete graph with a loop at each node.

    Example 1.5 (Card shuffling) The usual “riffle shuffle” can be modeled as a Markov chain

    on the set of all permutations of 52 cards. A step is generated by taking a random sequence

    ε1ε2 . . . ε52 of 0’s and 1’s of length 52. If there are a 1’s in the sequence, we take off a cards

    from the top of the deck, place them to the right of the rest, and then merge the two piles so

    that the i-th card comes from the pile on the right if and only if εi = 1. The question is, how

    many steps are needed to “shuffle well”, i.e., to get a permutation approximately uniformly

    distributed over all permutations? The surprising answer is: seven shuffle moves suffice.

    1.2 Matrices of graphs

    We need the following theorem. We call an n× n matrix irreducible, if it does not contain ak × (n− k) block of 0-s disjoint from the diagonal.

    Theorem 1.6 (Perron-Frobenius) If an n×n matrix has nonnegative entries then it hasa nonnegative real eigenvalue λ which has maximum absolute value among all eigenvalues.

    This eigenvalue λ has a nonnegative real eigenvector. If, in addition, the matrix is irreducible,

    then λ has multiplicity 1 and the corresponding eigenvector is positive (up to sign change).

    The adjacency matrix of a simple graphG is be defined as the n×nmatrix A = AG = (Aij)in which

    Aij =

    {1, if i and j are adjacent,

    0, otherwise.

    If G has multiple edges, then we let Aij = a(i, j). We could also allow loops and include this

    information in the diagonal. In this course, a loop at node i adds one to the degree of the

    node.

    The Laplacian of the graph is defined as the n× n matrix L = LG = (Lij) in which

    Lij =

    {d(i), if i = j,

    −a(i, j), if i 6= j.

    So L = D −A, where D = DG is the diagonal matrix of the degrees of G. Clearly L1 = 0.

    4

  • Let P = D−1A be the transition matrix of the random walk. Explicitly,

    (P )ij = pij =a(i, j)

    d(i).

    If G is d-regular, then P = (1/d)A.

    The matrix P is not symmetric in general, but the equation

    D1/2PD−1/2 = D−1/2AD−1/2

    shows that P is similar to the symmetric matrix P̂ = D−1/2AD−1/2. In particular, the

    eigenvalues λ1 > λ2 ≥ · · · ≥ λn of P are also eigenvalues of P̂ , and hence they are realnumbers. If w1, . . . ,wn are the corresponding orthonormal eigenvectors, then the spectral

    decomposition

    P̂ =

    n∑k=1

    λkwkwTk

    gives the decomposition

    P =

    n∑k=1

    λkukvTk , (3)

    where uk = D−1/2wk are right eigenvectors and vk = wk)

    TD1/2 are left eigenvectors of P .

    We have uTkvl = 1(k = l), and u1 = 1 and v1 = π.

    We can express the distribution after t steps by a simple formula. We have

    σ(k+1)j =

    ∑i

    σ(k)i

    a(i, j)

    d(i).

    This can be written as σk+1 = PTσk, and hence

    σk = (PT )kσ0 =

    n∑k=1

    λk(uTkσ

    0)vk. (4)

    The matrix entry (P t)ij is the probability that starting at i we reach j in t steps.

    Theorem 1.7 −1 is an eigenvalue of P if and only of G is bipartite.

    1.3 Stationary distribution

    Considering a random walk on a graph G, define

    πi =d(i)

    2m(i ∈ V ). (5)

    This is a probability distribution on V , called the stationary distribution of the chain. Note

    that

    πipij =1

    2m. (6)

    5

  • If we start the chain from initial distribution σ0 = π, then

    σ1j =∑i

    πipij = d(j)1

    2m= πj ,

    and repeating this, we see that the distribution σt after any number of steps remains π. (This

    explains the name.)

    In the more general case of Markov chains, we cannot give such a simple definition, but the

    Perron–Frobenius Theorem implies that there is a distribution (πi : i = 1, . . . , n) preserved

    by the chain, i.e.,

    n∑j=1

    πipij = πj . (7)

    So π is a left eigenvector of the transition matrix P , belonging to the eigenvalue 1. (The

    corresponding right eigenvector is 1.)

    It also follows from the Perron–Frobenius Theorem that the stationary distribution of an

    irreducible Markov chain is unique.

    If G is regular, then the Markov chain is symmetric: puv = pvu. We say that the

    Markov chain is time-reversible, if πupuv = πvpvu for all u and v. The random walk on an

    (undirected) graph is time-reversible by (6). (Informally, this means that a random walk

    considered backwards is also a random walk.)

    Theorem 1.8 For the random walk on a non-bipartite graph, the distribution of vt tends to

    the stationary distribution as t→∞.

    This is not true for bipartite graphs if n > 1, since vt is concentrated one one or the other

    color class, depending on the parity of t.

    Proof. By the Perron–Frobenius Theorem, every eigenvalue of P is in the interval [−1, 1];if G is non-bipartite then it follows that −1 is not an eigenvalue. Using (18), we see that

    σt =

    n∑k=1

    λk(uTkσ

    0)vk → (uT1σ0)v1 (t→∞).

    We know that u1 = 1 and v1 = π, and uT1σ

    0 =∑i σ

    0i = 1, so σ

    t → π as claimed. �

    1.4 Harmonic functions

    Let G be a connected simple graph, S ⊆ V and f : V → R. The function f is called aharmonic at a node v ∈ V if

    1

    d(v)

    ∑u∈N(v)

    f(u) = f(v), (8)

    6

  • asserting that the value of f at v is the average of its values at the neighbors of v. A node

    where a function is not harmonic is called a pole of the function. Another way of writing the

    definition is∑u∈N(v)

    (f(v)− f(u)) = 0 ∀v ∈ V \ S. (9)

    If we allow multiple edges, then the definition is

    1

    d(v)

    ∑u∈V

    a(u, v)f(u) = f(v), (10)

    where a(u, v) is the multiplicity of the edge uv.

    Every constant function is harmonic at each node. On the other hand,

    Proposition 1.9 Every nonconstant function on the nodes of a connected graph has at least

    two poles.

    Proof. Let S be the set where the function assumes its maximum, and let S′ be the set of

    those nodes in S that are connected to any node outside S. Then every node in S′ must be

    a pole, since in (8), every value f(u) on the left had side is at most f(v), and at least one

    is less, so the average is less than f(v). Since the function is nonconstant, S is a nonempty

    proper subset of V , and since the graph is connected, S′ is nonempty. So there is a pole

    where the function attains its maximum. Similarly, there is another pole where it attains its

    minimum. �

    For any two nodes there is a nonconstant harmonic function that is harmonic everywhere

    else. More generally, we have the following theorem.

    Theorem 1.10 For a connected simple graph G, nonempty set S ⊆ V and function f0 : S →R, there is a unique function f : V → R extending f0 that is harmonic at each node of V \S.

    We call this function f the harmonic extension of f0. Note that if |S| = 1, then theharmonic extension is a constant function (and so it is also harmonic at S, and it does not

    contradict Proposition 1.9).

    The uniqueness of the harmonic extension is easy by the argument in the proof of Propo-

    sition 1.9. Suppose that f and f ′ are two harmonic extensions of f0. Then g = f − f ′ isharmonic on V \ S, and satisfies g(v) = 0 at each v ∈ S. If g is the identically 0 function,then f = f ′ as claimed. Else, either its minimum or its maximum is different from 0. But

    we have seen that both the minimizers and the maximizers contain at least one pole, which

    is a contradiction.

    To prove the existence, we describe three constructions, which all will be useful.

    7

  • (a) Let u be the (random) point where a random walk starting at vertex v hits S (we

    know this happens almost surely), and let f(v) = E(f0(u)). Then f is a harmonic extension

    of f0.

    (b) Consider the graph G as an electrical network, where each edge represents a unit

    resistance. Keep node u ∈ S at electric potential f0(u), and define f(v) as the potential ofnode v. Then f is a harmonic extension of f0. (Use Kirchhoff’s Laws.)

    (c) Consider the edges of the graph G as ideal rubber bands with unit Hooke constant

    (i.e., it takes h units of force to stretch them to length h). Let us nail down each node u ∈ Sto point f0(u) on the real line, and let the graph find its equilibrium. The energy is a positive

    definite quadratic form of the positions of the nodes, and so there is a unique minimizing

    position, which is in equilibrium. The positions of the nodes define a harmonic extension of

    f0.

    Example 1.11 Let S = {a, b} and f0(a) = 0, f0(b) = 1. Let f be the harmonic extension.Then f(v) is the probability that a random walk staring at v hits b before a.

    We can extend the notion of harmonic functions to Markov chains. We say that a function

    f : V → R is harmonic at node i, if∑j

    pijf(j) = f(i).

    Essentially the same proof as for random walks implies that if a function on the nodes of an

    irreducible Markov chain is harmonic at every node, then the function is constant.

    Lemma 1.12 Let us stretch two nodes a, b of a graph G in the rubber band model to distance

    1. Let F (a, b) be the force needed for this, and let f(u) be the position of node u if a is at

    point 0 and b is at point 1.

    (a) The effective resistance in the electrical network between a and b is 1/F (a, b).

    (b) F (a, b) =∑ij∈E(f(i)− f(j))2.

    Proof. (a) We have F (a, b) =∑i∈N(a) f(i), since a neighbor i of a pulls a with force

    f(a). On the other hand, if we fix the potentials f(a) = 0 and f(b) = 1, then by Ohm’s

    Law, the current through an edge ai is f(i). Hence the current through the network is∑i∈N(a) f(i) = F (a, b), and by Ohm’s Law, the effective resistance is 1/F (a, b).

    (b) For the rubber band model, imagine that we slowly stretch the graph until nodes

    a and b will be at distance 1. When they are at distance t, the force pulling our hands is

    tF (a, b), and hence the energy we have to spend is

    1∫0

    tF (a, b) dt =1

    2F (a, b).

    8

  • This energy accumulates in the rubber bands. By a similar argument, the energy stored in

    the rubber band ij is (f(i)− f(j))2/2. By conservation of energy, we get the identity∑ij∈E

    (f(i)− f(j))2 = F (a, b). (11)

    Exercise 1.13 For a finite Markov chain, its underlying graph is obtained byconnecting two states i and j by an edge if either pij > 0 or pji > 0. Prove that ifthe underlying graph of a Markov chain is a tree, then the chain is time-reversible.

    Exercise 1.14 Let G be a bipartite graph with bipartition {U,W}. Let us starta random walk from a node u ∈ U . Prove that for i ∈ U ,

    σ2ti →d(i)

    m(t→∞).

    Exercise 1.15 Let G be the standard grid graph in the plane (defined on latticepoints, where two of them are connected by an edge if their distance is 1; so G iscountably infinite), and let f be a non-negative valued harmonic function on G.Prove that f is constant.

    2 Times

    The return time Ru to node u is the expected number of steps of the random walk startingat u before it returns to u.

    The hitting time H(u, v) (also called access time) from vertex u to vertex v is theexpected number of steps of a random walk starting at u before visiting v. We set

    Hmax = maxu,vH(u, v).The commute time comm(u, v) between vertices u and v is the expected number of steps

    of a random walk starting at u before visiting v and returning to u. Clearly comm(u, v) =

    comm(v, u) = H(u, v) +H(v, u).The cover time C(u) from vertex u is the expected number of steps of a random walk

    starting at u before every vertex is visited. We set Cmax = maxu C(u).Warning: often the hitting time is defined as a random variable, the number of steps of

    the random walk starting at u before visiting v (which depends on the random walk), and the

    number H(u, v) is called the expected hitting time or mean hitting time; similarly for commetc.

    Example 1.2 concerns hitting time, Example 1.4 concerns cover time, while Example 1.5

    is about “mixing time” to be discussed later.

    2.1 Return time

    We start with a technical lemma.

    9

  • Lemma 2.1 Let v0, v1, . . . be a walk in a finite irreducible Markov chain started a node u.

    Then the random variable T = min{t : vt = w} has finite expectation and variance for anynode w.

    Proof. Between any two nodes u and w there is a path u0 = u, u1, . . . , uk = w of length

    k < n such that pui,ui+1 > 0 for every 0 ≤ i < k. Therefore there is an ε > 0 such thatthe probability that we visit w within n steps is at least ε. This holds for the next stretch

    of n steps etc. By the Waiting Problem (Example 1.2), the expected number of stretches

    to wait is at most 1/ε, so E(T) < 1/ε is finite. Similarly, E(T2) is finite, and hence so is

    Var(T) = E(T2)− E(T)2. �

    This lemma implies that Ru is finite and well-defined.

    Theorem 2.2 For every finite Markov chain and every node u, Ru = 1/πu.

    Proof. Before giving an exact proof, let us describe a simple heuristic while this is true.

    Consider a very long random walk v0, v1, . . . , vT started from the stationary distribution.

    Then P(vt = u) = πu, and so the expected number of visits to u is Tπu. The expected time

    between two consecutive visits is Ru, so the total time T is about TπuRu. So T = TπuRu,and πuRu = 1.

    To make this precise, Let N be a large positive integer, let ε > 0, and set t = (1− ε)RN .Start a random walk from the stationary distribution. Let Tk denote the time of the k-th

    visit to u, and let X denote the number of visits before time t.

    We have E(Tk+1−Tk) = Ru, and the differences Tk+1−Tk are independent identicallydistributed random variables with finite expectation and variance by Lemma 2.1, hence

    1

    NTN =

    1

    NT1 +

    1

    N

    N−1∑k=1

    (Tk+1 −Tk)

    will be arbitrarily close to R if N is large enough, with probability arbitrarily close to 1. This

    means that we probability at least 1− ε, TN ≥ t, and in such cases, X ≤ N . Thus

    E(X) ≤ P(X ≤ N)N + P(X > N)t ≤ N + εt.

    On the other hand, we have E(X) = tπu, and so (πu − ε)t = (πu − ε)(1 − ε)RuN ≤ N .Dividing by N and letting ε → 0, we get πuRu ≤ 1. The inequality πuRu ≥ 1 followssimilarly. �

    Corollary 2.3 For the random walk on a graph starting from node v, the expected number

    of steps before an edge uv is passed (in this direction) is 2m.

    10

  • 2.2 Hitting time

    There is a basic equation for hitting times:

    H(i, j) =

    1 +

    1

    d(i)

    ∑k∈N(i)

    H(k, j), if i 6= j,

    0, if i = j.

    (12)

    Note that

    1 +1

    d(i)

    ∑k∈N(i)

    H(k, i) = Ru =1

    πu

    is the return time to i, so in terms of the matrix

    H = (H(i, j))ni,j=1,

    we can write (12) as

    (I − P )H = J −R, (13)

    where R is the diagonal matrix with the return times in the diagonal.

    We can give a nice geometric interpretation. Consider the graph as a rubber band struc-

    ture, and attach a weight of d(v) to each node v. Nail the node b to the wall and let the

    graph find its equilibrium. Each node v will be at a distance of H(v, b) below b.The following two examples are easy to verify using this geometric interpretation.

    Example 2.4 The hitting time on a path of length n from one endpoint to the other is n2;

    more generally, from a node at distance k from an endpoint v to v is n2−(n−k)2 = k(2n−k).

    Example 2.5 The hitting time between two nodes at distance k of a circuit of length n is

    k(n− k).

    Example 2.6 The hitting time on the 3-dimensional cube from one vertex to the opposite

    one is 10.

    Example 2.7 Take a clique of size n/2 and attach to it an endpoint of a path of length n/2.

    Let i be the attachment point, and j, the “free” endpoint of the path. Then

    H(i, j) = n3

    8.

    This last example is the worst for trying to hit a node as soon as possible, at least up to

    a constant.

    Theorem 2.8 For the random walk on a connected graph, for any two nodes at distance r,

    we have H(i, j) < 2mr < n3.

    11

  • Proof. For two adjacent nodes i and j, starting from i, in expected 2m time the edge ij is

    passed from j to i, so H(i, j) < 2m. More generally, let (i = v0, v1, . . . , vr = j) be a path, andlet Tk be the first time when v0, . . . , vk have been visited. By the above, E(Tk+1−Tk) < 2m,and hence E(Tr) < 2mr < n

    3. �

    Example 2.9 H(u, v) is not a symmetric function, even for time-reversible chains (evenrandom walks on undirected graphs): for the first two nodes u and v on a path of length n,

    we have H(u, v) = 1 but H(v, u) = 2n− 3. The H(u, v) may be different from H(v, u) evenon a regular graph (Exercise 2.24).

    However, if the graph has a node-transitive automorphism group, then H(u, v) = H(v, u)for any two nodes (Corollary 2.12).

    Lemma 2.10 (Cycle Reversal Lemma) For the random walk on a connected graph, for

    any three nodes u, v and w, H(u, v) +H(v, w) +H(w, u) = H(u,w) +H(w, v) +H(v, u).

    Proof. Starting a random walk at u, walk until v is visited; then walk until w is visited;

    then walk until u is reached. Call this random sequence a uvwu-tour. The expected number

    of steps in a uvwu-tour is H(u, v) +H(v, w) +H(w, u). On the other hand, we can expressthis number as follows. Let W = (u0, u1, . . . , uN = u0) be a closed walk. The probability

    that we have walked exactly this way is

    P(W ) =N−1∏i=0

    1

    d(ui),

    which is independent of the starting point and remains the same if we reverse the order.

    Let a(W ) denote the number of ways this closed walk arises as a uvwu-tour, i.e., the

    number of occurrences of u in W where we can start W to get a uvwu-tour (note that the

    same value would be obtained by considering v or w instead of u). We shall show that the

    number of ways the reverse closed walk W ′ = (uN , uN−1, . . . , u0 = uN ) arises as a uwvu-tour

    is also a(W ). Since the expected length of a uvwu-tour is∑W p(W )a(W )|W |, this will prove

    the identity in the problem. (It will also follow that a(W ) is 1 or 2.)

    Call an occurrence of u in the closed walk W “forward good” if starting from u and

    following the walk until v occurs, then following it until w occurs, then following it until u

    occurs, we traverse the whole walk exactly once. Call this occurrence “backward good” if this

    holds with the orientation of W as well as the role of v and w reversed. Clearly a(W ) is the

    number of “forward good” occurrences of u, so it suffices to verify that for every closed walk

    W , the number of “forward good” occurrences of u is the same as the number of “backward

    good” occurrences. (Note that a “forward good” occurrence need not be “backward good”.)

    Assume that W arises as a uvwu-tour at least once; say u0 = u, ui = v, and uj = w

    (0 < i < j < N), where W1 = {u1, . . . , ui−1} does not contain v, W2 = {ui+1, . . . , uj−1}

    12

  • does not contain w and W3 = {uj+1, . . . , uN−1} does not contain u. Assume first that W2does not contain u either. Then u0 is the only “forward good” occurrence of u, and the last

    occurrence of u in W1 is the only “backward good” occurrence.

    Second, assume that W2 contains u. Similarly, we may assume that W3 contains v and W1

    contains w. Let ut be the last occurrence of u in W2. It is easy to check that ut is “backward

    good”. So we see that if W arises as a uvwu-tour then it also arises as a uwvu-tour.

    Assume now that a(W ) > 1. Then there must be a second “forward good” element, and

    it is easy to check that this can only be the first occurrence us of u on W2; it also follows

    that all occurrences of v on W2 must come before us, and similarly, all occurrences of w on

    W3 must come before the first occurrence of v on W3, and all occurrences of u on W1 must

    come before the first occurrence of w on W1. But in this case there are exactly two “forward

    good” and exactly two “backward good” occurrences of u. So a(W ) = a(W ′) = 2. �

    Corollary 2.11 The vertices of any graph can be ordered so that if u precedes v then

    H(u, v) ≤ H(v, u).

    Proof. Fix any node u, and define and ordering (v1, . . . , vn) of the nodes so that

    H(u, v1)−H(v1, u) ≥ H(u, v2)−H(v2, u) ≥ · · · ≥ H(u, vn)−H(vn, u).

    Then by Lemma 2.10,

    H(vi, vj) +H(vj , u) +H(u, vi) = H(vj , vi) +H(vi, u) +H(u, vj),

    and so for i < j,

    H(vj , vi)−H(vi, vj) = (H(u, vi)−H(vi, u))− (H(u, vj)−H(vj , u)) ≥ 0.

    Corollary 2.12 If a graph has a node-transitive automorphism group, then H(u, v) =H(v, u) for any two nodes.

    We express hitting times in terms of the transition matrix. By (18), we have

    I − P =n∑k=2

    (1− λk)ukvTk .

    Consider the matrix

    (I − P )′ =n∑k=2

    1

    1− λkukv

    Tk .

    13

  • This is a pseudoinverse of I − P : the matrix I − P is singular, so it does not have a properinverse, but we have (I−P )(I−P )′(I−P ) = I−P , (I−P )′(I−P )(I−P )′ = (I−P )′, and it iseasy to check that (I−P )′(I−P ) and (I−P )(I−P )′ are symmetric matrices. Furthermore,(I − P )′1 = (I − P )′u1 = 0 from the orthogonality relations, and so (I − P )′J = 0. We cancompute (I − P )′ by making it nonsingular, and then inverting it:

    (I − P )′ = (I − P + x1πT)−1 − 1x1πT

    with an arbitrary x 6= 0.Let S = −(I−P )′R, where R is the diagonal matrix with the return times in the diagonal.

    Using that R = 2mD−1, we can write this as

    G = −2mD−1/2(I −D−1/2PD1/2)′D−1/2 = −2mD−1/2(I − P̂ )′D−1/2,

    showing that G is a symmetric matrix.

    Lemma 2.13 H(i, j) = Gij −Gjj.

    Proof. Recall the equation

    (I − P )H = J −R. (14)

    Unfortunately, the matrix I−P is singular, and so (14) does not uniquely determine H. Butusing the generalized inverse,

    (I − P )H = (I − P )(I − P )′(I − P )H = (I − P )(I − P )′(J −R) = −(I − P )(I − P )′R,

    and hence

    (I − P )(H + (I − P )′R) = 0.

    The left nullspace of I − P consists of the multiples of 1, hence (I − P )X = 0 implies thatevery column of X is constant. So

    H = −(I − P )′R+ 1gT = G+ 1gT. (15)

    with some vector g. Using that Hii = 0, we get that gi = −Gii. �

    Corollary 2.14 Let λ1 = 1 > λ2 ≥ . . . ≥ λn be the eigenvalues of the transition matrix Pof the random walk on graph G, and let u1, . . . ,un,v1, . . . ,vn be the corresponding right and

    left eigenvectors. Then

    H(i, j) = 2mn∑k=2

    ukjvkj − ukivkjd(j)(1− λk)

    .

    14

  • Proof. We have

    Gij = ((I − P )′R)ij =n∑k=2

    ukivkjRj1− λk

    = 2m

    n∑k=2

    ukivkjd(j)(1− λk)

    .

    Substituting in the formula of Lemma 2.13, the corollary follows. �

    Lemma 2.15 (Random Target Lemma) For every Markov chain,∑j

    πjH(i, j) = N (16)

    is independent of the starting node i.

    Proof. Let f(i) =∑j πjH(i, j); we show that f is harmonic. Indeed, for every node i,∑

    j

    pijf(j) =∑j

    pij∑k

    πkH(j, k) =∑k

    πk∑j

    pijH(j, k)

    =∑k 6=i

    πk(H(i, k)− 1) + πi(Ri − 1) =∑k

    πkH(i, k) = f(i).

    In the case of random walks on a graph, we can express N by the spectrum. Using (15),

    Hπ = (I − P )′Rπ + 1gTπ = (I − P )′1 + (gTπ)1 = (gTπ)1.

    Hence

    N = gTπ =n∑k=2

    1

    1− λk.

    2.3 Commute time

    Lemma 2.16 The probability that a random walk on a graph starting at u visits v before

    returning to u is Ru/comm(u, v)).

    Proof. Let T be the first time the random walk starting at u returns to u, and let S be the

    first time it returns to u after visiting v. Then E(T) = Ru and E(S) = comm(u, v). ClearlyS ≥ T, and equality holds if and only if the random walk visits v before returning to u.Hence

    E(S−T) = p · 0 + (1− p) · comm(u, v).

    Thus

    comm(u, v) = E(S) = E(T) + E(S−T) = Ru + (1− p)comm(u, v),

    and the lemma follows. �

    15

  • Lemma 2.17 If a connected graph G is considered as an electrical network (with unit

    resistances on the edges), then the effective resistance R(u, v) between nodes u and v is

    2m/comm(u, v).

    Proof. Let φ : V → R be the (unique) function that satisfies φ(u) = 0, φ(v) = 1, and isharmonic at the other nodes.

    Let us keep node u at potential 0 and node v at potential 1. By Ohm’s Law, the current

    on an edge ij, in the direction from i to j, is φ(j)− φ(i). Hence the total current from u tov is

    ∑j∈N(u) φ(j), and the effective resistance of the network is

    R(u, v) =( ∑j∈N(u)

    φ(j))−1

    . (17)

    On the other hand, for every j, φ(j) is the probability that the walk hits v before u. Hence

    (1/d(u))∑j∈N(u) φ(j) is the probability that starting from u, we visit v before returning to

    u. Thus by Lemma 2.16,

    1

    d(u)

    ∑j∈N(u)

    φ(j) =Ru

    comm(u, v).

    Together with (17), this proves the lemma. �

    Theorem 2.18 For any two nodes at distance r, comm(i, j) < 4mr < 2n3.

    This follows similarly as Theorem 2.8.

    2.4 Cover time

    Example 2.19 Assuming that we start from 0, the cover time of the path on n nodes will

    also be (n− 1)2, since it suffices to reach the other endnode.

    Example 2.20 To determine the cover time Cn of a cycle of length n, note that it is thesame as the time needed on a very long path, starting from the midpoint, to visit n nodes.

    We have to reach first n− 1 nodes, which takes Cn−1 steps on the average. At this point, wehave a subpath with n−1 nodes covered, and we are sitting at one of its endpoints. To reacha new node means to reach one of the endnodes of a path with n+ 1 nodes from a neighbor

    of an endnode. Clearly, this is the same as the hitting time between two consecutive nodes

    of a circuit of length n, which is one less than the return time to the second node. This leads

    to the recurrence

    Cn = Cn−1 + (n− 1).

    Hence Cn = n(n− 1)/2.

    16

  • Theorem 2.21 Cmax < 2nm.

    Let Hmax = maxu,vH(u, v) and Hmin = minu,vH(u, v).

    Theorem 2.22 har(n)Hmin ≤ Cmax ≤ har(n)Hmax.

    Proof. Let Let (σ1, . . . , σn) be a random permutation of the vertices, and let Ak be the

    event that σk is the last visited node of {σ1, . . . , σk}. Then

    P(Ak) =1

    k.

    Note: this is independent of the walk. Let Tk be the first time σ1, . . . , σk are all visited, then

    E(Tk −Tk−1 | Ak) ≤ Hmax,

    and

    E(Tk −Tk−1 | Ak) = 0.

    Hence

    E(Tk −Tk−1) ≤1

    kHmax +

    k − 1k

    0 =Hmaxk

    .

    Summing over all k we get the upper bound in the theorem. The lower bound follows

    similarly. �

    2.5 Universal traverse sequences

    Let G be a connected d-regular graph, u ∈ V (G), and assume that at each node, the ends ofthe edges incident with the node are labeled 1, 2, . . . , d. A traverse sequence (for this graph,

    starting point, and labeling) is a sequence (h1, h2, . . . , ht) ⊆ {1, . . . , d}t such that if we starta walk at v0 = u and at the ith step, we leave the current vertex through the edge labeled

    hi, then we visit every vertex. A universal traverse sequence is a sequence which is a traverse

    sequence for every connected d-regular graph on n vertices, every labeling of it, and every

    starting point.

    Theorem 2.23 For every d ≥ 2 and n ≥ 2, there exists a universal traverse sequence oflength O(d2n4).

    Proof. The “construction” is easy: we consider a random sequence. More exactly, let

    t = d8dn3 log ne, and let H = (h1, . . . , ht) be randomly chosen from {1, . . . , d}t. For afixed G, starting point, and labeling, the walk defined by H is just a random walk; so the

    probability p that H is not a traverse sequence is the same as the probability that a random

    walk of length t does not visit all nodes.

    17

  • By Theorem 2.21, the expected time needed to visit all nodes is at most dn2. Hence

    (by Markov’s Inequality) the probability that after 2dn2 steps we have not seen all nodes is

    less than 1/2. Since we may consider the next 2dn2 steps as another random walk etc., the

    probability that we have not seen all nodes after t steps is less than 2−t/(4n2) = n−2nd.

    Now the total number of d-regular graphs G on n nodes, with the ends of the edges

    labeled, is less than ndn (less than nd choices at each node), and so the probability that

    H is not a traverse sequence for one of these graphs, with some starting point, is less than

    nnndn−2nd < 1. So at least one sequence of length t is a universal traverse sequence. �

    Exercise 2.24 Show by an example that H(u, v) can be different from H(v, u)even for two nodes of a regular graph.

    Exercise 2.25 Consider random walk on a connected regular graph G.

    (a) The average of H(s, t) over all s ∈ N(t) is exactly n− 1.(b) The average of H(s, t) over all s ∈ V (G) \ {t} is at least n− 1.(c) The average of H(t, s) over all s ∈ V (G), with weights πs, is at least n− 1.

    Exercise 2.26 Prove that the mean hitting time between two antipodal verticesof the k-cube Qk is asymptotically 2

    k.

    Exercise 2.27 What is the cover time of the path when starting from an internalnode?

    Exercise 2.28 The mean commute time between any pair of vertices of a d-regular graph is at least n and at most 2nd/(d− λ2).

    Exercise 2.29 Let G′ denote the graph obtained from G by identifying s andt, and let T (G) denote the number of spanning trees of G. Prove that Rst =T (G′)/T (G).

    Exercise 2.30 (Raleigh’s Principle) Adding any edge to a graph G does notincrease the resistance Rst.

    Exercise 2.31 Let G′ be obtained from the graph G by adding a new edge (a, b),and let s, t ∈ V (G).(a) Prove that the mean commute time between s and t in G′ is not larger thanthe mean commute time in G.

    (b) Show by an example that similar assertion is not valid for the mean hittingtime.

    (c) If a = t then the mean hitting time from s to t is not larger in G′ than in G.

    Exercise 2.32 Find a formula for the commute time in terms of the spectrum.

    Exercise 2.33 Prove that the commute time between two nodes of a regulargraph is at least n.

    Exercise 2.34 Let `(u, v) denote the probability that a random walk starting atu visits every vertex before hitting v.(a) Prove that if G is a circuit of length n then `(u, v) = 1/(n− 1) for all u 6= v.(b) If u and v are two non-adjacent vertices of a connected graph G such that{u, v} is not a cutset then there is a neighbor w of u such that `(w, v) < `(u, v).(c) Assume that for every pair u 6= v ∈ V (G), `(u, v) = 1/(n − 1). Show that Gis either a circuit or a complete graph.

    18

  • Exercise 2.35 Let N(u, v) denote the expected number of nodes a random walkfrom u visits before v (including u; not the number of steps!). Prove that for everyu ∈ V (G) there exists a v ∈ V (G) \ {u} such that N(u, v) ≥ n/2.

    Exercise 2.36 Let b be the expected number of steps before our random walkvisits more than half of the vertices. Prove that b ≤ 2Hmax.

    Exercise 2.37 For S ⊆ V , define HSmin = minu,v∈AH(u, v). Prove that Cmax ≥har(|A|)HAmin.

    3 Mixing

    Perhaps the most important use of random walks in practice is sampling: choosing a random

    element from a prescribed distribution over a large set.

    A general method for sampling from a probability distribution π over a large and com-

    plicated set V is the following. We define a graph G with node set V whose stationary

    distribution is π. Let us assume the graph is not bipartite (for example, consider the lazy

    walk on it). By Theorem 1.8, σt → π as t→∞ for every starting distribution σ. This meansthat by simulating a random walk on G for sufficiently many steps, we get a point whose

    distribution is close to π. The question is, what does “close” mean, and how long do we have

    to walk?

    The total variation distance between two probability distributions on the same finite set

    V is defined by

    dvar(σ, τ) = maxA⊆V

    (σ(A)− τ(A)).

    Since σ(A)− τ(A) = τ(V \ A)− σ(V \ A), we could define the same value as the maximumof τ(A)− σ(A), or as the maximum of |σ(A)− τ(A)|. It could be expressed as

    dvar(σ, τ) =1

    2

    ∑i∈V|σi − τi|.

    The ε-mixing time is defined as the smallest t such that

    dvar(σt, π) ≤ ε

    for every starting distribution σ. It is easy to see that the worst starting distribution is

    concentrated on a single node.

    This definition makes sense only for non-bipartite graphs. To avoid this exception, we

    often consider the lazy walk: At each step, we flip a coin and we stay if its Head, and move

    according to the random walk if it is Tail. This can be described as the random walk on

    a graph where we add d(i) loops to each node i. (Recall that adding a loop increases the

    degree by 1 only.) Making the random walk lazy does not change the stationary distribution.

    It doubles all the previous ”times”, since we are idle in half of the steps in expectation.

    19

  • The transition matrix P of the lazy random walk can be expressed by the transition

    matrix P of the original walk very simply:

    P =1

    2(P + I).

    Since all eigenvalues of P are ≥ −1, all eigenvalues of P ′ are nonnegative. In other words,the symmetrized version is positive semidefinite.

    Example 3.1 On a complete graph Kn, starting from a given node, we have

    dvar(σ1, π) =

    1

    n.

    More generally,

    dvar(σt, π) =

    1

    n(n− 1)t−1

    (exercise ??).

    The ε-mixing time is in general difficult to compute, even to estimate. We are going to

    discuss four different methods.

    3.1 Eigenvalues

    Theorem 3.2 Let G be a connected graph and let λ1 = 1 > λ2 ≥ · · · ≥ λn be the eigenvaluesof its transition matrix. Set λ = max{λ2, |λn|} and πmin = mini πi. Then

    |σt(S)− π(S)| ≤

    √π(S)

    πminλt.

    If we start at node i, then

    |ptij − πj | ≤√πjπiλt.

    Proof. Consider the spectral decomposition of P :

    P =

    n∑k=1

    λkukvTk , (18)

    where uk and vk are the right and left eigenvectors of P , with u1 = 1 and v1 = π. So

    σt(S) = σTP t1S ==

    n∑k=1

    λtkσTukv

    Tk1S = π(S) +

    n∑k=2

    λtkσTukvk1S ,

    and hence

    |σt(S)− π(S)| ≤ λtn∑k=2

    |σTuk| |vTk1S | ≤ λtn∑k=1

    |σTuk| |vTk1S |

    20

  • We can estimate this using Cauchy–Schwarz:

    n∑k=1

    |σTuk| |vTk1S | ≤

    (n∑k=1

    |σTuk|2)1/2( n∑

    k=1

    |vTk1S |2)1/2

    .

    Using that (wk)nk=1 is an orthonormal basis,

    n∑k=1

    |σTuk|2 =n∑k=1

    |σTΠ−1/2wk|2 = |Π−1/2σ|2 =

    (n∑i=1

    σi√πi

    )2≤ 1πmin

    , (19)

    and similarly

    n∑k=1

    |vTk1S |2 = |Π1/21S |2 =∑i∈S

    πi = π(S).

    Thus

    |σt(S)− π(S)| ≤ λt 1√πmin

    √π(S)

    as claimed. If σ is concentrated on node i, then we don’t have to make the last step in (19).

    Example 3.3 Consider the k-cube as above. For a ∈ {0, 1}k, consider the vector va ∈ RQk

    defined by

    vax = (−1)aTx.

    It is easy to check that this is an eigenvector of P with eigenvalue 1 − 2|a|1k , where |a|1 =∑ki=1 ai. Since there are 2

    k such vectors va, these are all eigenvectors. So the eigenvalues of

    P are

    1, 1− 2k, 1− 4

    k, . . . , 1− 2k

    k= −1.

    So the eigenvalues of the lazy walk Plazy are

    1, 1− 1k, 1− 2

    k, . . . , 1− k

    k= 0.

    Hence by Theorem 3.2, we have

    dvar(σt, π) ≤

    √2k(

    1− 1k

    )t< 2k/2et/k.

    So to t = k2 + Ck, we have dvar(σt, π) < (2/e)k/2e−C .

    21

  • 3.2 Coupling

    A coupling in a Markov chain is a sequence of random pairs of nodes ((v0, u0), (v1, u1), . . . )

    such that each of the sequences (v0, v1, . . . ) and (u0, u1, . . . ) is a walk of the chain.

    Lemma 3.4 (Coupling Lemma) Let (v0, v1, . . . ) be random walk starting from distribu-

    tion σ, and let (u0, u1, . . . ) be another random walk starting from the stationary distribution

    π. Then

    dvar(σt, π) ≤ P(vt 6= ut).

    If we can couple the two random walks so that P(vt 6= ut) < ε for some t, then we get abound on the ε-mixing time.

    Proof. Clearly ut is stationary for every t ≥ 0. Hence for every S ⊆ V ,

    P(vt ∈ S)− π(S) = P(vt ∈ S)− P(ut ∈ S)

    = P(vt ∈ S | ut = vt)P(ut = vt) + P(vt ∈ S | ut 6= vt)P(ut 6= vt)

    − P(ut ∈ S | ut = vt)P(ut = vt)− P(ut ∈ S | ut 6= vt)P(ut 6= vt)

    = (P(vt ∈ S | ut 6= vt)− P(ut ∈ S | ut 6= vt))P(ut 6= vt)

    ≤ P(ut 6= vt).

    Example 3.5 Consider a random walk on the k-cube Qk. Since this is bipartite, we look at

    the lazy version. The vertices of the cube are sequences (x1, x2, . . . , xk), where xi ∈ {0, 1}.Let us assume that we start at vertex (0, . . . , 0). Then the random walk can be described as

    being at (x1, x2, . . . , xk), we choose a coordinate 1 ≤ i ≤ k, and flip it with probability 1/2.So a tour of length t is described by a sequence (i1, . . . , it) (is ∈ {1, . . . , k}), and a sequence(ε1, . . . , εt) (εs ∈ {0, 1}). Step s consists of adding εs to xis modulo 2.

    If we condition on the sequence (i1, . . . , it), then each xi for which i ∈ {i1, . . . , it} will beuniformly distributed over {0, 1}, and these distributions are independent. So conditioningon {i1, . . . , it} = {1, . . . , k}, the distribution will be uniform on all vertices of the cube.The probability that {i1, . . . , it} 6= {1, . . . , k} can be estimated from the Coupon Collector’sProblem: The expectation of the first t for which {i1, . . . , it} = {1, . . . , k} is khar(k), andso the probability that in 2khar(k) steps we have not seen all directions is at most 1/2 by

    Markov’s Inequality. It follows that after 2Nkhar(k), the probability that {i1, . . . , it} 6={1, . . . , k} is at most 2−N . Hence the ε-mixing time is at most log2(1/ε)khar(k).

    Compare this time bound with Exercise 2.26: the mean hitting time between antipodal

    vertices of Qk is approximately 2k.

    Consider the lazy walk v0, v1, . . . on the k-cube Qk started from an arbitrary distribu-

    tion σ0. We couple it with the lazy random walk u0, u1, . . . started from the stationary

    22

  • (uniform) distribution π. Recall that the random walk v0, v1, . . . is determined by two

    random sequences: a sequence (i1, i2 . . . ) (is ∈ {1, . . . , k}) of direction, and a sequence(ε1, ε2, . . . ) (εs ∈ {0, 1}) go-or-wait decisions. Let us use the same sequence of directions foru0, u1, . . . , but choose (ε′1, ε

    ′2, . . . ) as follows. At a given time t, let v

    t = (x1, . . . , xk) and

    ut = (y1, . . . , yk). If xit = yit , then let ε′t = εt; else, let ε

    ′t = 1− εt.

    After this step, the ii-th coordinate of ut is equal to the it-th coordinate of v

    t, and they

    remain equal forever. So if all coordinates have been chosen, then ut = vt. By the same

    computation as before, we see that after t = 2Ck ln k steps, we have

    dvar(σt, π) ≤ P(ut 6= vt) ≤ 1

    C.

    Example 3.6 Consider the lazy walk v0, v1, . . . on the path of length n started from an

    arbitrary distribution σ0. We couple it with the lazy random walk u0, u1, . . . started from

    the stationary distribution π. A step of the random walk is generated by a coin flip interpreted

    as ”Left” or ”Right”, and another coin flip interpreted as ”Go” or ”Stay”. If we are at an

    endpoint of the path, then both ”Left” and ”Right” mean the same move.

    It will be convenient to assume that ut and vt start at an even distance from each other.

    This can be achieved by interpreting the first “Go” and “Stay” outcome differently if neces-

    sary. From here on, the two walks will “Go” and “Stay” at the same time. The coupling of

    the directions is also very simple: the two moves are independent until the two walks collide,

    and from then on, they stay together.

    Suppose that u0 is to the right of v0. Then by the time vt reaches the right end of

    the path, the two walks must have collided. This takes at most n2 expected time. By

    Markov’s Inequality, the probability that vt 6= ut for t = 2n2 is less than 1/2, and so aftert = 2n2 log(1/e) steps,

    dvar(σt, π) ≤ P(ut 6= vt) ≤ ε.

    Note that if we start from an endpoint of the path, and dvar(σt, π) < 1/4 for some t,

    then the walk must end up in the right half of the path. If T is the first time it reaches the

    midpoint (say, n is even), then E(T ) = n2/4. With some work, one can deduce from this

    that dvar(σt, π) < 1/4 implies that t > n2/8.

    3.2.1 Random coloring of a graph

    We want to choose a uniformly random k-coloring of a graph G (such that adjacent nodes

    get different colors). The method we describe will work when the maximum degree D of G

    satisfies k > 3D.

    We define a graph H whose nodes are the k-colorations of G, with two of them connected

    by an edge if they differ at a single node. The degrees of H are bounded by kn; we make it

    kn-regular by adding appropriately many loops at the nodes.

    23

  • While the graph itself may be exponentially large, it is easy to generate a random walk

    on H. If we have a k-coloring α, we select a uniformly random node n of G, and select a

    uniformly random color i; the pair (v, i) is the seed of the step. The new coloring α′ is defined

    by

    α′(u) =

    {i ha u = v,

    α(u) otherwise,

    if this is a legal coloring; else, let α′ = α0. This random walk on colorings is called “Glauber

    dynamics”.

    Theorem 3.7 If D > 3k, then starting from any k-coloring α, we have

    dvar(αt, π) < e−t/(kn)n.

    So if t = kn ln(n/ε), then dvar(αt, π) < ε.

    Proof. We use the Coupling Lemma 3.4. Starting from two colorings α0 and β0 (where

    β0 is uniformly random), we construct two random walks α0, α1, . . . and β0, β1, . . . by using

    the same random seed in both. We want to show that

    P(αt 6= βt) ≤ ne−t/(kn). (20)

    Let U t = {v ∈ V : αt(v) 6= βt(v)}, W t = V \ U t and Xt = |U t|. For v ∈ V , let a(v)denote the number of edges joining v to the other class of the partition {U t,W t}. We claimthat

    E(Xt+1 | Xt) ≤(

    1− 1kn

    )Xt. (21)

    Indeed, let (v, i) be the seed at step t, and let us fix v. If v ∈ U t, then Xt − 1 ≤ Xt+1 ≤ Xt,and we have Xt − 1 = Xt+1 if color i does not occur among the neighbors of v in eithercoloring αt or βt. Since at least a(v) colors are common, the probability of this is

    P(Xt+1 = Xt − 1 | v) ≥ k − 2d(v) + a(v)k

    . (22)

    Ha v ∈W t, then Xt ≤ Xt+1 ≤ Xt+1, and we lose only if color i occurs among the neighborsof v in exactly one of the colorings αt and βt. There are at most 2a(v) such colors, and so

    P(Xt+1 = Xt + 1 | v) ≤ 2a(v)k

    . (23)

    Thus

    E(Xt+1 | Xt) ≤ Xt −∑i∈Ut

    k − 2d(i) + a(i)kn

    +∑i∈W t

    2a(i)

    kn.

    24

  • Using that∑i∈Ut

    a(i) =∑i∈W t

    a(i),

    we get

    E(Xt+1 | Xt) ≤ Xt −∑i∈Ut

    k − 2d(i)− a(i)kn

    ≤ Xt −∑i∈Ut

    k − 3d(i)kn

    .

    The last sum is at most Xt/(kn), and so (21) follows.

    From (21) we get by induction

    E(Xt) ≤(

    1− 1kn

    )tn,

    and hence by Markov’s Inequality,

    P(αs 6= βs) = P(Xt ≥ 1) ≤(

    1− 1kn

    )tn < e−t/(kn)n.

    3.3 Conductance

    The conductance of a Markov chain is defined by

    Φ = minA

    ∑i∈A

    ∑j∈V \A πipij

    π(A)π(V \A),

    where the minimum is extended over all non-empty proper subsets A of V . The numerator is

    the frequency with which a very long random walk steps from A to V \A. The numerator is thefrequency with which a very long sequence of independent random nodes from the stationary

    distribution steps from A to V \A. So the ratio measures how strongly non-independent thenodes in a random walk are.

    Let (say) π(A) ≤ 1/2, then π(V \A) ≥ 1/2 and∑i∈A

    ∑j∈V \A

    pijπi ≤∑i∈A

    ∑j∈V

    πipij =∑i∈A

    πi = π(A),

    and hence Φ ≤ 2.In the case of a random walk on a graph, we have πipij = 1/(2m) for every edge ij. Let

    e(A, V \A) denote the number of edges connecting A to V \A, then

    Φ = minA

    e(A, V \A)2mπ(A)π(V \A)

    .

    We can use the conductance to bound the eigenvalue gap and through this, the ε-mixing

    time.

    25

  • Theorem 3.8Φ2

    16≤ 1− λ2 ≤ Φ.

    We need a couple of lemma. The following formula for the eigenvalue gap in the case of

    random walk on a graph follows from more general results in linear algebra, but we describe

    a direct proof.

    Lemma 3.9

    1− λ2 =1

    2mmin

    ∑ij∈E(G)

    (xi − xj)2 :∑i

    πixi = 0,∑i

    πix2i = 1

    .Proof. Let y be the unit eigenvector of the symmetrized transition matrix P̂ , belonging

    to the eigenvalue λ2, and define xi = yi/√πi. the vector y is orthogonal to the eigenvector

    (√πi : i ∈ V ) belonging to the eigenvalue 1, and hence∑

    i

    πixi =∑i

    √πiyi = 0,

    and ∑i

    πix2i =

    ∑i

    y2i = 1.

    Furthermore,

    1

    2m

    ∑ij∈E(G)

    (xi − xj)2 =1

    2m

    ∑i∈V

    d(i)x2i −1

    m

    ∑ij∈E(G)

    xixj =∑i∈V

    πix2i − yTP̂ y = 1− λ2.

    It follows ba a similar computation that this choice of x minimizes the right hand side. �

    The second lemma can be considered as a linearized version of the theorem.

    Lemma 3.10 Let G be a graph with conductance Φ. Let y ∈ RV and assume that π({i : yi >0}) ≤ 1/2, π({i : yi < 0}) ≤ 1/2 and

    ∑i π(i)|yi| = 1. Then

    1

    m

    ∑ij∈E|yi − yj | ≥ Φ.

    Proof. Label the points by 1, . . . , n so that

    y1 ≤ · · · ≤ yt < 0 = yt+1 = · · · = ys < ys+1 ≤ . . . ≤ yn.

    Set Si = {1, . . . , i}. Substituting yj − yi = (yi+1 − yi) + · · ·+ (yj − yj−1), we get

    ∑ij∈E|yi − yj | =

    n−1∑i=1

    |∇(Si)|(yi+1 − yi) ≥ 2mΦn−1∑i=1

    (yi+1 − yi)π(Si)π(V \ Si).

    26

  • Using that π(Si) ≤ 1/2 for i ≤ t, π(Si) ≥ 1/2 for i ≥ s + 1, and that yi+1 − yi = 0 fort < i < s, we obtain∑

    ij∈E|yi − yj | ≥ mΦ

    t∑i=1

    (yi+1 − yi)π(Si) +mΦn−1∑i=t+1

    (yi+1 − yi)π(V \ Si)

    = mΦ∑i

    π(i)|yi| = mΦ.

    Proof of Theorem 3.8. We prove the upper bound first. By Lemma 3.9, it suffices to

    exhibit a vector x ∈ RV such that∑i

    πixi = 0,∑i

    πix2i = 1 (24)

    and

    1

    2m

    ∑ij∈E(G)

    (xi − xj)2 = Φ. (25)

    Let S be the minimizer in the definition of the conductance, and consider a vector of the

    type

    xi =

    {a, if i ∈ S,b, if i ∈ V \ S.

    Then the conditions are

    π(S)a+ π(V \ S)b = 0, π(S)a2 + π(V \ S)b2 = 1.

    Solving these equations for a and b, we get

    a =

    √π(V \ S)2mπ(S)

    , b = −

    √π(S)

    2mπ(V \ S),

    and then straightforward substitution shows that (25) is satisfied as well.

    To prove the lower bound, we again invoke Lemma 3.9: we prove that for every vector

    x ∈ RV satisfying (24), we have∑ij∈E(G)

    (xi − xj)2 ≥Φ2

    8. (26)

    Let x be any vector satisfying (24). We may assume that x1 ≥ x2 ≥ . . . ≥ xn. Let k(1 ≤ k ≤ n) be the index for which π({1, . . . , k − 1}) ≤ 1/2 and π({k + 1, . . . , n}) < 1/2.Setting zi = max{0, xi − xk} and choosing the sign of x appropriately, we may assume that∑

    i

    π(i)z2i ≥1

    2

    ∑i

    π(i)(xi − xk)2 =1

    2

    ∑i

    π(i)x2i − xk∑i

    π(i)xi +m

    2x2k

    =1

    2+m

    2x2k ≥

    1

    2.

    27

  • Now Lemma 3.10 can be applied to the numbers yi = z2i /∑i π(i)z

    2i , and we obtain that∑

    ij∈E|z2i − z2j | ≥ mΦ

    ∑i

    π(i)z2i .

    On the other hand, using the Cauchy-Schwartz inequality,

    ∑ij∈E|z2i − z2j | ≤

    ∑ij∈E

    (zi − zj)21/2∑

    ij∈E(zi + zj)

    2

    1/2 .It is easy t see that∑

    ij∈E(zi − zj)2 ≤

    ∑ij∈E

    (xi − xj)2,

    while the second factor can be estimated as follows:∑ij∈E

    (zi + zj)2 ≤ 2

    ∑ij∈E

    (z2i + z2j ) = 2

    ∑i∈V

    d(i)z2i = 4m∑i

    π(i)z2i .

    Combining these inequalities, we obtain

    ∑ij∈E

    (xi − xj)2 ≥∑ij∈E

    (zi − zj)2 ≥

    ∑ij∈E|z2i − z2j |

    2/ ∑ij∈E

    (zi + zj)2

    ≥ Φ2m2(∑

    i

    π(i)z2i

    )2/4m∑i

    π(i)z2i

    =Φ2m

    4

    ∑i

    π(i)z2i ≥Φ2m

    8.

    Dividing by 2m, the theorem follows. �

    Corollary 3.11 For any starting distribution σ, any subset A ⊆ V and any t ≥ 0,

    dvar(σt, π) ≤ 1√

    πmin

    (1− Φ

    2

    16

    )t.

    To estimate the conductance, the following lemma is often very useful.

    Lemma 3.12 Let F be a multiset of paths in G and suppose that for every pair i 6= j ofnodes there are πiπjN paths in the family connecting them for some N ≥ 1. Suppose thateach edge of G belongs to at most K of these paths. Then Φ ≥ N/(2mK).

    Proof. Let A ⊂ V , A 6= ∅. Let FA denote the subfamily of paths in F with one endpointin A and one endpoint in V \A. Then

    |FA| =∑i∈A

    ∑j∈V \A

    πiπjN = π(A)π(V \A)N.

    28

  • On the other hand, every path in FA contains at least one edge connecting A to V \ A. Soif e(A, V \A) denotes the number of such edges, then

    |FA| ≤ Ke(A, V \A).

    Hence

    Φ ≥ e(A, V \A)2mπ(A)π(V \A)

    ≥ |FA|2mKπ(A)π(V \A)

    =N

    2mK.

    Corollary 3.13 Let G be a graph with a node- and edge-transitive automorphism group and

    diameter d ≥ 1.Then Φ > 1/d.

    Proof. Let us select a shortest path Pij between every pair of nodes {i, j}. Every nodehas πi = 1/n. Let F consist of these paths and their images under all automorphisms. So|F| = g

    (n2

    ), where g is the number of automorphisms, and every pair of nodes is connected

    by g paths. The total length of these paths is at most dg(n2

    ). By symmetry, every edge is

    covered by the same number of paths, which is at most dg(n2

    )/m. So we can apply Lemma

    3.12 with N = n2g and K = dg(n2

    )/m, to get that

    Φ ≥ N2mK

    =n2

    2d(n2

    ) > 1d.

    Example 3.14 For the k-cube Qk, the last corollary gives that Φ > 1/k, and so the eigen-

    value gap is between 1/(8k2) and 1/k. We know that the eigenvalue gap is in fact 1/k.

    Exercise 3.15 Prove that dvar(σ, τ) satisfies the triangle inequality.

    Exercise 3.16 Prove that dvar(σ, τ) is convex in both arguments.

    Exercise 3.17 Prove that the ε-mixing time for the random walk on a k×k gridis at least cn2 and at most Cn2, where c and C are constants depending on ε only.

    Exercise 3.18 A coupling of two random walks u0, u1, . . . and v0, v1, . . . is calledMarkovian, if the sequence of pairs (ui, vi) forms a Markov chain, i.e., given ui andvi, the distribution of the pair (ui+1, vi+1) does not depend on the “prehistory”.(All couplings we constructed in the lecture had this property.)Let T be the (random) first time when uT = vT. Let s = E(T). Prove thatVar(T) ≤ 8T 2.

    Exercise 3.19 Let us start a lazy random walk on a graph from node v, and letσt denote the distribution after t steps. Prove that σtv is monotone decreasing asa function of t.

    29

  • 4 Stopping rules

    A stopping rule Γ is a map V ∗ → [0, 1], such that Γ(v0, . . . , vt) is interpreted as the probabilityof continuing given that (v0, . . . , vt) is the walk so far observed. Each such stop-or-go decision

    is made independently. (It suffices to define Γ whenever (v0, . . . , vt) has positive probability,

    and Γ(v0, . . . , vr) > 0 for 0 ≤ r < t.)We usually assume that the walk is stopped by Γ after a finite number of steps with

    probability 1. In this case, Γ can be regarded as a random variable with values in {0, 1, . . . },so that we stop at vΓ. The condition on Γ is that it is independent of the later part of the

    walk, i.e., of vΓ+1, vΓ+2, . . . In this disguise, stopping rules are usually called stopping times.

    The condition EΓ

  • We start with the following basic identity.

    Lemma 4.2 The exit frequencies of any stopping rule Γ from distribution σ to τ satisfy∑i

    pijxi(Γ)− xj(Γ) = τj − σj (j ∈ V ).

    Proof. This follows by counting the number of times the node j is entered and exited. �

    Introducing the vector x(Γ) = (xi(Γ) : i ∈ V ), we can write this identity as

    (PT − I)x = τ − σ. (27)

    The exit frequencies are almost determined by the starting and ending distributions:

    Lemma 4.3 For any distributions σ and τ , and for any two stopping rules Γ and Γ′ from

    σ to τ , and any i ∈ V ,

    xi(Γ)− xi(Γ′) = πi(EΓ− EΓ′).

    Proof. If x′ is the vector of exit frequencies of Γ′, then by (27) we have (PT−I)(x−x′) = 0.Since the null space of PTI consists of the multiples of π, it follows that x−x′ = Kπ for somereal number K. By summing all entries, we see that K = EΓ− EΓ′. �

    If, in particular, σ = τ , and Γ′ is the rule “don’t move”, then we get that from any

    stopping rule Γ with the same starting and ending distribution

    xi(Γ) = πiEΓ. (28)

    As an application of this formula, start a walk from a node i, and use the stopping rule

    Γ: “stop when you return to i”. The expected length is the return time: EΓ = Ri. Equation

    (28) gives that xi(Γ) = πiRi. But clearly xi(Γ) = 1 (we exit the starting node exactly once),

    and hence Ri = 1/πi. We also get that the expected number of times node j is visited before

    returning to i is πjRi = πj/πi.

    Lemma 4.4 The exit frequencies for the naive stopping rule from state i to state j are given

    by

    xk(i, j) = πk(H(i, j) +H(j, k)−H(i, k)

    ).

    Proof. Clearly xj(i, j) = 0 for every i and j.

    Starting from node i, consider stopping rule “stop when you reach i after having seen j”.

    By (28),

    xk(i, j) + xk(j, i) = πk(H(i, j) +H(j, i)).

    31

  • In particular,

    xk(k, i) = πk(H(i, k) +H(k, i))− xk(i, k) = πk(H(i, k) +H(k, i)).

    By a similar argument,

    xk(i, j) + xk(j, k) + xk(k, i) = πk(H(i, j) +H(j, k) +H(k, i)).

    and so

    xk(i, j) = πk(H(i, j) +H(j, k) +H(k, i))− xk(j, k)− xk(k, i)

    = πk(H(i, j) +H(j, k) +H(k, i))− πk(H(i, k) +H(k, i))

    = πk(H(i, j) +H(j, k)−H(k, i)).

    It follows easily that the exit frequencies for the naive stopping rule from σ to τ are given

    by

    xk(Ωσ,τ ) = πk∑i,j

    σiτj(H(i, j) +H(j, k)−H(i, k)) (29)

    A state j for which xj(Γ) = 0 is called a halting state for the rule Γ.

    Theorem 4.5 A stopping rule is optimal for starting distribution σ and target distribution

    τ if and only if it has a halting state.

    Proof. The “if” direction is easy. Let Γ be a rule from σ to τ that has a halting state j,

    and let Γ′ be any other rule from σ to τ . By Lemma 4.3,

    xi(Γ′)− xi(Γ) = πi(EΓ′ − EΓ).

    In particular,

    πj(EΓ′ − EΓ) = xj(Γ′) ≥ 0,

    showing that EΓ′ ≥ EΓ. Since this holds for every Γ′, rule Γ is optimal.To prove the converse, we need to construct a stopping rule from σ to τ with a halting

    state. Starting the chain from distribution σ, let αU be the distribution of the first node in

    U , for every nonempty subset U of V . For example, αV = σ. We consider αU as a probability

    distribution on V , even though it is concentrated on U .

    We construct a decomposition

    τ =

    k∑i=1

    γiαSi , (30)

    32

  • where S1 ⊃ S2 ⊃ · · · ⊃ Sk are nonempty subsets of V , and γi > 0.Let S1 be the support of τ , and γ1 = max{c : cαS1 ≤ τ}. Then the support S2 of

    τ − γ1αS1 is a proper subset of S1. Assuming that S1 ⊃ S2 ⊃ · · · ⊃ Sr and γ1, . . . , γr havebeen chosen, we define Sr+1 as the support of τ − γ1αS1 − · · · − γrαSr . If Sr+1 = ∅, we stop.Else, let

    γr+1 = max{c : cαSr+1 ≤ τ − γ1αS1 − · · · − γrαSr}.

    It is easy to see that this gives a decomposition (30). Summing over V , we get that∑i γi = 1,

    so the coefficients γi give a probability distribution on {1, . . . , k}.This leads to a simple stopping rule: we choose i randomly from the distribution γ, and

    walk until we hit Si. The probability that we stop at node v is

    k∑i=1

    γi(αSi)v = τv.

    Furthermore, any node v ∈ Sk is a halting state, since v ∈ Si for every choice of i. �

    Remark 4.6 The stopping rule in the proof can be called the “chain rule”, referring to

    the chain of subsets. A rather neat way to think of this rule is to assign a “price” av =∑{γi : Si 3 v} to each node v. The rule is then implemented by choosing a random

    “budget” b uniformly from [0, 1] and walking until a state j with aj ≤ b (an item that wecan buy) is reached.

    Example 4.7 If we start from an endpoint on a path, then the naive rule for reaching the

    stationary distribution is optimal. Indeed, the other endpoint is a halting state.

    Example 4.8 The analysis in Example 3.5 suggests the following stopping rule Γ for the lazy

    random walk on the k-cube Qk: Starting from a vertex u, we stop when (with the notation

    there) {i1, . . . , it} = {1, . . . , k}. We have seen that the distribution of vΓ is uniform on allvertices of the cube, so we have a stopping rue from u to π. If we ever reach the vertex u′ of

    the cube opposite to u, then we have flipped every coordinate, so we must stop. This means

    that u′ is a halting state for this rule, and so it optimal. It follows that H(uπ) = khar(k).

    It follows from Lemma 4.3 that the exit frequencies of any optimal stopping rule from σ

    to τ are the same. We denote them by xi(σ, τ). We denote by H(σ, τ) the expected length ofany optimal stopping rule. If σ is concentrated on a node i and τ is concentrated on a node

    j, then H(σ, τ) = H(i, j).

    Theorem 4.9 The expected number of steps of an optimal stopping rule is given by

    H(σ, τ) = maxj

    (H(σ, j)−H(τ, j)

    ),

    and the exit frequencies are given by

    xk(σ, τ) = πk (H(τ, k)−H(σ, k) +H(σ, τ)) .

    33

  • The proof will show that the maximum is achieved precisely when j is a halting state of

    some optimal rule which stops at τ when started at σ.

    Proof. Recall that the exit frequencies of the naive stopping rule are given by

    xk(Ωσ,τ ) = πk∑i,j

    σiτj(H(i, j) +H(j, k)−H(i, k)).

    To get the exit frequencies of an optimal stopping rule we have to subtract a multiple of π

    to make the minimal exit frequency equal to zero. This means that

    xk(σ, τ) = πk

    (∑i,j

    σiτj(H(i, j)+H(j, k)−H(i, k))−minm

    ∑i,j

    σiτj(H(i, j)+H(j,m)−H(i,m)).

    After cancelation, we get

    xk(σ, τ) = πk((H(τ, k)−H(σ, k)) + max

    m(H(σ,m)−H(τ,m))

    ).

    Summing over k,

    H(σ, τ) =∑k

    πkH(τ, k)−∑k

    πkH(σ, k) + maxm

    (H(σ,m)−H(τ,m)).

    The first two terms cancel by the Random Target Lemma. This proves the first formula.

    Using it, we get

    xk(σ, τ) = πk((H(τ, k)−H(σ, k)) +H(σ, τ)

    )as claimed. �

    4.2 Mixing and ε-mixing

    In practice, it could be difficult to follow a complicated stopping rule, but we can follow a

    simple rule and not lose much.

    Theorem 4.10 Let Y be chosen uniformly from {0, . . . , t − 1}. Then for any starting dis-tribution σ and 0 ≤ ε ≤ 1,

    dvar(σY , π) ≤ 1

    tH(σ, π).

    Proof. Let Ψ be an optimal stopping rule from σ to π. Consider the following rule: follow

    Ψ until it stops at vΨ, then generate Z ∈ {0, . . . , t − 1} uniformly (and independently fromthe previous walk), and walk Z more steps. Since Ψ stops with a node from the stationary

    distribution, σΨ+t is also stationary for every t ≥ 0 and hence so is σΨ+Z .

    34

  • On the other hand, let Y = Ψ+Z (mod t), then Y is uniformly distributed over {0, . . . , t−1}, and so

    σYi = P(vY = i) ≥ P(vΨ+Z = i, Ψ + Z ≤ t− 1)

    ≥ P(vΨ+Z = i)− P(vΨ+Z = i, Ψ + Z ≥ t)

    = πi − P(vΨ+Z = i, Ψ + Z ≥ t).

    Hence for every set A of nodes,

    σY (A) ≥ π(A)− P(vΨ+Z ∈ A, Ψ + Z ≥ t) ≥ π(A)− P(Ψ + Z ≥ t).

    For any fixed value of Ψ, the probability that Ψ + Z ≥ t is at most Ψ/t, and hence

    P(Ψ + Z ≥ t) ≤ E(Ψt

    )=

    1

    tH(σ, π).

    So π(A)− σY (A) ≤ H(σ, τ)/t, proving the theorem. �

    Exercise 4.11 Prove that H(σ, τ) satisfies the triangle inequality: H(σ, ρ) ≤H(σ, τ) +H(τ, ρ).

    Exercise 4.12 Prove that H(σ, τ) is convex in both arguments.

    Exercise 4.13 Let α, β, γ, δ be four distributions such that α − β = c(γ − δ).Then H(α, β) = cH(γ, δ).

    Exercise 4.14 Consider the following “local” stopping rule Λ, given distributionsσ and τ on V : Compute numbers xi (i ∈ V ) such that mini xi = 0, and

    ∑i pijxi−

    xj = τj−σj for all j ∈ V . If we are at node i, we stop with probability τi/(xi +τi)(if xi = τi = 0 the stopping probability can be set to 0).

    (a) Prove that there exist numbers xi satisfying the above conditions.

    (b) Prove that the rule stops with probability 1.

    (c) Prove that σΛ = τ .

    (d) Prove that Λ is an optimal rule from σ to τ .

    Exercise 4.15 Prove that for the random walk on a graph,

    H(i, π) = maxjH(j, i)−H(π, i).

    Exercise 4.16 Let Φ : σ → τ and Ψ : σ → ρ be two stopping rules such that Φstops not later than Ψ for any walk starting from σ. Prove that

    H(τ, ρ) ≤ EΨ− EΦ.

    Exercise 4.17 Call a node s of a graph G pessimal if it maximizes H(s, π). Leti be a pessimal node and let j be a halting node from i to π. Then j is alsopessimal, and i is a halting state for H(j, π). In particular, every graph with atleast two nodes has at least two pessimal nodes.

    Exercise 4.18 Assume that there is a stopping rule from σ to τ that never makesmore than t steps (t ≥ 0).(a) Prove that H(σ, τ) +H(τ, σt) ≤ t.(b) Choose a random integer Y uniformly from the interval [t, 2t − 1], and walkfor Y steps, starting from a random node from σ. Then the probability of beingat node i is at most 2πi.

    35

  • 5 Applications

    5.1 Volume computation

    Computing the volume of high-dimensional convex bodies, and related tasks like numerical

    integration in high dimension, have profound applications from geometry to statistics to

    theoretical physics. The volume of certain special convex bodies can be expressed explicitly.

    We will need that the volume of a ball B ⊆ Rn with radius 1 is

    vol(B) =

    πk

    k!if n = 2k,

    22k+1k!πk

    (2k + 1)!if n = 2k + 1,

    and using Stirling’s Formula,

    vol(B) ∼ 1√πn

    (2eπn

    )n/2(n→∞).

    Computing the volume of a general n-dimensional convex body, even of a polytope, is NP-

    hard. For general convex bodies, every estimate of the volume that can be computed in

    polynomial time will err on some inputs with a factor that grows exponentially with the

    dimension.

    But if we allow randomized algorithms (using a random number generator), one can

    compute an estimate of the volume such that the probability that the relative error is larger

    than a prescribed ε > 0 is arbitrarily small.

    The basic tool is generating an approximately uniformly distributed random point in

    a convex body. The random point in the convex body is generated using Markov chains.

    The mixing rate of the Markov chain can be estimated using conductance. This in itself is

    an interesting geometric problem, using an “isoperimetric inequality” concerning subsets of

    convex bodies.

    The algorithm sketched in this course has a running time of about n10, which is polynomial

    in the dimension n, but rather large. Further improvements (quite elaborate) lead to a

    running time of O(n4).

    5.1.1 What is a convex body?

    A convex body is a compact and full-dimensional convex set in Rn. For algorithmic purposes,special classes of convex bodies occur with quite different descriptions in different situations.

    Polyhedra are usually described as solutions sets of systems of linear inequalities; but

    polytopes (bounded convex polyhedra) can also be specified by listing their vertices. These

    two descriptions are equivalent in the sense of classical mathematics, but from an algorithmic

    point of view, it may make a lot of difference to have one description or the other as input:

    for example, to maximize a linear objective function over the polytope is trivial if an explicit

    36

  • list of the vertices is given, but quite difficult if the polytope is described by linear inequalities

    (it is just the task of solving linear programs).

    One meets many other forms of descriptions of convex bodies. In analysis, convex sets

    often arise as level sets or epigraphs of convex functions, or unit balls of norms. We want our

    results to be as independent from the specific description as possible. A way to achieve that

    is to describe a convex body K ⊆ Rn (as an input to an algorithm) by a membership oracle:a subroutine (oracle) deciding whether a point y ∈ Rn belongs to K or not. (We could allowthe oracle to err near the boundary of K.)

    In addition, we shall assume throughout that we know the center and radius r of some

    ball contained in K, and the radius R of another concentric ball containing K. One can

    apply an appropriate affine transformation (this is nontrivial, but outside the topic of this

    course) after which both balls are centered at the origin, r = 1, and R = n3/2.

    5.2 Lower bounds on the complexity

    Theorem 5.1 Let A be a polynomial algorithm that computes a number V (K) for everyconvex body K given by a membership oracle, such that V (K) ≥ vol(K). Then for everylarge enough dimension n, there is a convex body in Rn for which V (K) > 20.99nvol(K).

    The proof shows that there is a trade-off between time and precision: if we want (say) a

    relative error less than 2, then we need exponential time. The error of 20.99n in the theorem

    can be replaced (with a more involved proof) by n0.99n. This means that up to the coefficient

    of n in the exponent, the very rough estimate vol(B) ≤ vol(K) ≤ n3n/2vol(B) cannot beimproved in general.

    Lemma 5.2 Let B be the unit ball in Rn, and let P ⊆ B be a convex polytope with p vertices.Then

    vol(P ) ≤ p2n

    vol(B).

    Proof. For every vertex of the polytope P , consider the ball Bv for which the segment

    connecting v to the origin is a diameter (the “Thales ball” over the segment). We claim that

    these balls cover the whole polytope. Indeed, let u ∈ P , then the closed halfspace uTx ≥ uTucontains at least one point of the polytope (namely u), and hence it contains a vertex v of

    P . But the the angle 0uv is obtuse, and hence u ∈ Bv.Since the diameter of Bv is at most half the diameter of B, we have vol(Bv) ≤ 2−nvol(B).

    Using this, is is easy to estimate the volume of P :

    vol(P ) ≤∑v

    vol(Bv) ≤ p2−nvol(B).

    This proves the Lemma. �

    37

  • Proof of the theorem. Apply the algorithm A with the ball B′ = n3/2B as its input. Itreturns a number V (B′) which satisfies V (B′) ≥ vol(B′).

    Next, let S be the set of points which were asked from the oracle and which it declared to

    be in the ball. Let Q = {n3/2e1, . . . , n3/2en,−n3/2e1, . . . ,−n3/2en}, and let P be the convexhull of S ∪Q. (Throwing in the points of Q is a little technicality, which is needed since wemust guarantee that the unit ball is contained in the convex hull of S.) Let p be the number

    of vertices of P .

    If we apply the algorithm A with input P , then comparing its run with the previous runstep-by-step, we see that it asks the same points from the oracle, and to these questions the

    oracle gives the same answers. Hence the final result must be the same, and so V (P ) = V (B′).

    By the Lemma,

    V (P ) = V (B′) ≥ vol(B′) ≥ 2n

    pvol(P ).

    Since p is bounded by a polynomial in n, the theorem follows. �

    5.2.1 Monte-Carlo algorithms

    It is a very interesting fact that once we allow randomized algorithms, the situation changes

    dramatically. Randomization brings Monte-Carlo algorithms to mind. We have already made

    the assumption that K is contained in the ball B′ = n3/2B. Let us generate many random

    points in B′ (this is not difficult), and count how often we hit K. This gives us an estimate

    on the ratio of the volumes of K and B, and we know the volume of B′. Are we done?

    The problem is not generating a random point in the ball; this can be accomplished as

    follows. Let X1, . . . , Xn be independent random variables from the standard normal distri-

    bution, and let Y be uniformly distributed in [0, 1]. Let Zi = n3/2Y 1/nXi/

    √X21 + · · ·+X2n,

    then z = (Z1, . . . , Zn) is uniformly distributed over B′.

    The problem is that the volume of K may be smaller than the volume of B by an

    exponential factor (in n). For example, if K is the cube whose vertices are all ±1-vectors,then its volume is 2n, while the smallest ball containing it has volume ∼ (2eπ/)n/2. Hence thefirst exponentially many random points will miss the body K. This method can be applied

    to estimate the ratio of the volumes of two convex bodies (one including the other) only if

    this ratio is not too small.

    This suggests the trick: “connect” K and B by a sequence of convex bodies K = K0 ⊆K1 ⊆ . . . ⊆ Km = B′, so that vol(Ki)/vol(Ki+1) ≥ 1/2. Then these ratios can be estimatedby the Monte-Carlo method, and their product gives an estimate on the ratio vol(K)/vol(B′).

    Such a sequence is easily constructed: we can take e.g. Ki = K ∩ 2i/nB. Trivially K0 ⊆K1 ⊆ . . .. Since B ⊆ K, it follows that K0 = B and Km = K for m = d(3/2)n log ne.Furthermore, since

    21/nKi = 21/nK ∩ 2(i+1)/nB ⊇ K ∩ 2(i+1)/nB = Ki+1,

    38

  • we have

    vol(Ki+1) ≤ vol(21/nKi) = 2vol(Ki).

    However, estimating vol(Ki)/vol(Ki+1) by the Monte-Carlo method not so easy; the key

    question in this algorithm is: how to generate a uniformly distributed random point in a

    convex body?

    5.2.2 Measurable Markov chains

    As it can be expected, we use Markov chains. But K is an infinite set, and we have studied

    finite Markov chains so far. One possibility is to take a sufficiently fine grid, and do random

    walk on the part of the grid inside K. This is doable, but not very efficient. It is better to

    generalize the theory of mixing time to Markov chains with an infinite underlying set.

    Let (Ω,A) be a σ-algebra. To define a Markov chain with this underlying space, we specify,for every u ∈ Ω, a probability measure Pu on (Ω,A). We assume that for every A ∈ A thevalue Pu(A) is measurable as a function of u. Selecting w

    0 from any starting distribution σ0

    on Ω, we can generate a walk, i.e., a sequence of random points w0, w1, w2, . . . from Ω such

    that wi+1 is chosen from distribution Pwi (independently of the values of w0, . . . , wi−1).

    We say that a probability distribution π on (Ω,A) is stationary, if selecting one pointfrom π and doing one step, the resulting point has distribution π. Explicitly,∫

    A

    Pu(A) dπ(u) = π(A).

    Such a distribution may not exist, but we will only need Markov chains where it does.

    Several notions and results can be extended to these kinds of Markov chains with little

    difficulty. Lazy walks can be defined as before. The variation distance of two probability

    distributions α and β is

    dvar(α, β) = supA∈A

    α(A)− β(A).

    The quantity πipij , which came up often for finite chains, can be replaced by

    Q(A,B) = P(u ∈ A, u′ ∈ B) =∫A

    Px(B) dπ(x)

    (here A,B ∈ A, u is a random point from π, and u′ is obtained by making a step from u).Define

    Φ(A) =Q(A,Ω \A)π(A)π(Ω \A)

    ,

    The conductance of the chain is

    Φ = infA∈A

    0

  • It is not hard to see that Q(A,Ω \A) = Q(Ω \A,A), and this implies that

    P(u′ ∈ Ω\A | u ∈ A)P(u ∈ A)+P(u′ ∈ A | u ∈ Ω\A)P(u ∈ Ω\A) = 2Φ(A)π(A)π(Ω\A).

    So we could define Φ(A) as half of the probability of crossing over between A and Ω \A froma random point from the stationary distribution.

    To use conductance in order to estimate mixing times, we have to generalize Corollary

    3.11. The quantity πmin causes difficulty, since this cannot be directly generalized to the

    infinite case. The theorem below shows a way out (one needs a different proof from what we

    gave above).

    Theorem 5.3 Suppose that a starting distribution satisfies σ(A) ≤ Kπ(A) for all A ∈ A.Then the lazy walk satisfies

    dvar(σt, π) ≤

    √K

    (1− Φ

    2

    16

    )t.

    for all t ≥ 0.

    5.2.3 Isoperimetry

    The other auxiliary result we need is from geometry.

    Theorem 5.4 Let K be a convex body in Rn with diameter d. Let K = K1 ∪K2 ∪K3 be apartition of K into three measurable sets such that the distance of K1 and K3 is at least ε.

    Then

    vol(K3) ≥ε

    dmin{vol(K1), vol(K2)}.

    A long narrow cylinder shows that the bound is essentially sharp.

    5.2.4 The ball walk

    We generate a random point in K by using a random walk with steps in a small ball. Let rB

    denote the ball centered at 0 with radius r. (For us the best choice for r will be r ≈ 1/√n,

    but for a while it will be better to leave this as a parameter.) Let v0 be drawn from some

    initial distribution σ on K. Given vk, we generate a vector u from the uniform distribution

    over the ball vk + rB (centered at vk and with radius r). If u ∈ K, we move to vk+1 = u.

    Else, we let vk+1 = vk. We call this the ball walk in K. This defines a time-reversible Markov

    chain, called the ball walk, and the uniform distribution on K is its stationary distribution.

    We need to estimate the conductance of the ball walk. Let K ′i ={x ∈ Ki : vol((x +

    rB) ∩Ki) ≥ 910vol(rB)}

    , and K ′3 = K \ (K ′1 ∪K ′2).The key observation is that the distance of K ′1 and K

    ′2 is at least r/

    √n. In fact, consider

    two points xi ∈ K ′i, and assume (by way of contradiction) that |xi − xj | < r/√n. Then

    40

  • vol((x1 + B) ∩ (x2 + B)) > vol(B)/5 (this is a nontrivial exercise in integration), and soeither K1 or K2 meets (x1 +B)∩ (x2 +B) in a set with volume at least vol(B)/10. But thiscontradicts the definition of the K ′i.

    So we can apply Theorem 5.4 to the decomposition K ′1 ∩K ′2 ∩K ′3 of K. Using that thediameter of the bodies we deal with is bounded by n3/2, we get

    vol(K ′3) ≥r

    n2min{vol(K ′1), vol(K ′2)} ≥

    r

    n2min{vol(K1)−vol(K ′3), vol(K2)−vol(K ′3)}. (31)

    and hence

    vol(K ′3) > rvol(K1)/(r + n2) > rvol(K1)/(2n

    2)

    Thus we get

    vol(K ′3) ≥r

    2n2min{vol(K1), vol(K2)}. (32)

    To estimate the conductance, notice that if we are at a point u ∈ Ki ∩K ′3 and generatea random point u′ ∈ u + B, then with probability at least 1/10, u′ will not be in K1.Unfortunately, this can mean that u ∈ K3−i or that u /∈ K; in the latter case, we stay in K1by the rule of the walk.

    For the (incomplete) analysis that follows, we are going to ignore the second possibility; it

    is small if the step-size r is small enough. With this simplification, we see that from (almost)

    every point u ∈ K ′3, we step to the other side of the partition {K1,K2} with probability atleast 1/20. It follows that

    Q(K1,K2) ≥vol(K ′3)

    20vol(K)≥ r

    40n2min{vol(K1), vol(K2)}

    vol(K)

    =r

    40n2min{π(K1), π(K2)} ≥

    r

    40n2π(K1)π(K2).

    So (subject to the simplification of ignoring the steps leaving K) the conductance is at least

    r/(40n2).

    Careful estimation of the case that was ignored above and optimization of the parameters

    leads to the choice of r = 1/√n, which gives a bound on the conductance of the order of

    n−5/2. This gives a polynomial bound of n5 on the mixing time.

    5.3 Random spanning tree

    Theorem 5.5 Consider a random walk on a graph G starting at node u, and mark, for each

    node different from u, the edge through which the node was first entered. The edges marked

    form a subtree T of G, which is spanning with probability 1, and every spanning tree occurs

    with the same probability.

    41

  • Proof. The proof uses a method called “coupling from the past”.

    It is easy to see that the first entrances to each node form a spanning tree.

    We consider a directed graph H whose nodes are pairs (T, u), where T is a spanning tree

    and u is any node of G, which we call the root. Draw a (directed) edge to (T ′, v) if uv ∈ E(G)and T ′ arises from T by deleting the first edge on the path from v to u and adding the edge

    uv. Let H denote the resulting digraph. Clearly each tree with root v has indegree and

    outdegree d(v) in H, and hence in the stationary distribution of a random walk on H, the

    probability of a spanning tree with a given root is proportional to the degree of the root (in

    G). If we draw a random spanning tree from this distribution, and then forget about the

    root, we get every spanning tree with the same probability.

    A random walk on G induces a random walk on H as follows. Assume that we are at a

    node v of G, and at a node (T, v) in H, where T is a spanning tree. If we move along an

    edge vw in G, then we can move to a node (T ′, w) in H by removing the first edge of the

    path from w to v and adding the edge vw to the current spanning tree. We can follow this

    random walk on H backwards as well, following the random walk on G backwards. We may

    assume that both random walks are infinite both to the past and to the future.

    Consider a particular time, and let T and v denote the current tree and the current root.

    Claim 1 The last exits from each node (as oriented edges) other v form a spanning tree,

    oriented to v.

    Claim 2 The tree of last exits is T .

    It follows that the last exits from each node other than the current node form a uniformly

    distributed random spanning tree at every time. Reversing time, the same follows for the

    first entrances. �

    42

    BasicsRandom walks and finite Markov chainsMatrices of graphsStationary distributionHarmonic functions

    TimesReturn timeHitting timeCommute timeCover timeUniversal traverse sequences

    MixingEigenvaluesCouplingRandom coloring of a graph

    Conductance

    Stopping rulesExit frequenciesMixing and -mixing

    ApplicationsVolume computationWhat is a conv


Recommended