Introduction to Discrete MCMC for RedistrictingIntroduction to Discrete MCMC for Redistricting Daryl...

Introduction to Discrete MCMC for Redistricting

Daryl DeFord

May 15, 2019

1 Introduction

This is intended as a friendly introduction to the ideas and background of Markov Chain Monte Carlo(MCMC) sampling methods, particularly for discrete state spaces. Applications to political districting havecreated a renewed interest in these methods among political scientists and legal scholars and this introductionaims to present the underlying mathematical material in an interactive and intuitive fashion.

We will start with a review of some basic terminology from probability, before discussing Monte Carlomethods and Markov chains separately. Next, we will see how these methodologies combine to form one ofthe most useful algorithms ever invented. Finally, we will conclude by exploring the application of thesemethods to generating large ensembles of political districting plans, an approach that has been successfulboth in court challenges to gerrymandered maps, as well as reform advocacy.

1.1 Sage Examples

Throughout this document there are many links to interactive tools for experimenting with the conceptsthat are introduced. A web page organizing all of the interactive elements can be found here and all of thesource code is available on GitHub. Each tool consists of an interactive terminal that allows you to vary theinput values and provides many different types of plots and visualizations. This is a great opportunity toexperiment with these concepts and build some extra intuition around the ideas.

2 Probability Background

This section provides a brief introduction to some terminology and ideas from probability that we will usethroughout the rest of the piece. If this feels like familiar material, you should feel free to skip ahead toSection 3 below. More detailed background information can be found in Grinstead and Snell’s Introductionto Probability (.pdf link) if this inspires you to read more.

2.1 Distributions and Random Variables

A probability distribution is a function that assigns a probability or likelihood to each of a collection of possibleoutcomes or states. An example is the result of rolling a die, where each face has an equal chance of beingon top when the die stops moving. This is an example of a uniform distribution, where each outcome hasexactly the same probability of occurring. An example of a non-uniform distribution is drawing a scrabbletile from the bag - there are 12 ’E’ tiles and only 3 ’G’ tile so we are 4 times as likely to draw an ’E’ as a ’G’.

We will use the terminology state to refer to a possible outcome of our random variable and state spaceto refer to the full collection of states. Thus, for the example of choosing a scrabble tile1, the state space isthe set of alphabet tiles (’A’ – ’Z’ plus the blank tile ’ ’). A random variable is a function that maps elementsof the state space to their probabilities. So the . We will summarize these visually using histograms, wherethe x–axis represents the individual elements of the state space and the height of the bars represent theirprobability of occurring. Figure 1 shows the histograms corresponding to the rolling a die and drawing ascrabble tile.

1If you are unfamiliar with Scrabble, see Section 2 below

1

https://people.csail.mit.edu/ddeford/mcmc_intro.php

https://github.com/drdeford/MCMC_Intro

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf

(a) Rolling a die (b) Drawing a scrabble tile

Figure 1: Example Histograms. All the faces of the die are equally likely, so the random variable has auniform distribution over all the states and all the bars are the same height. On the other hand, there aremore ’A’ tiles than ’Z’ tiles in Scrabble, so the probability of drawing an ’A’ is higher than drawing ’Z’.

Example 1 (Non–uniform Dice). The die we were considering above is a little boring since each face justappears once. We might instead consider what would happen if we made dice with more faces or with repeatedfaces. The interact module here: here will let you experiment with the distributions that you can generate bydesigning your own face values. Figure 2 shows the histograms for some non–standard dice.

(a) 1,2,2,3,3,3 (b) 1,2,3,4,4,5,6,6,6,7,8,8,8,9,9

Figure 2: Histograms for Non–standard Dice. If you vary the number of faces or the values on the faces of adie, this changes the probability distribution of the values that you will see. Compared to the standard diein Figure 1(a) the first die (a) still has six faces but only three values with different frequencies while the dierepresented by plot (b) has 15 faces, again with varying frequencies.

2.2 Expected Value

The expected value of a random variable is a weighted average of the values of the state space, where theweights are given by the probabilities of the individual states. This means that for each state, we multiplyits value by its probability of occurring and then add them all up. For rolling a die, the expected value is3.5 since each state occurs with equal probability:

1

6· 1 +

1

6· 2 +

1

6· 3 +

1

6· 4 +

1

6· 5 +

1

6· 6 =

21

6= 3.5

2

https://people.csail.mit.edu/ddeford/die_rolling.html

Notice although we will never actually roll a 3.5 on a 6 sided die, it does represent a type of averagevalue if we rolled the die many times and kept track of the results. To formalize this notion, Figure 3 belowshows the result of exactly this process by simulating rolling a die times and then measuring both the actualprobabilities of each value that occur and the average value across all the rolls over time. Notice that as wecontinue to roll, the average gets closer and closer to the expected value. Another way to express this is thatthe error, or difference between the empirical average and theoretical expected value, is heading towardszero the more times we roll the dice.

The same calculation approach applies when the probabilities are not equal, it just changes the weightson the values. For example, if we die whose faces are 1,2,2,3,3,3 like in Example 1 the expected value of aroll is:

1

6· 1 +

2

6· 2 +

3

6· 3 =

14

6

To try this out by computing simulations for the expected values your own dice examples you can usethe code here. Try to compute these values for the non–standard dice introduced in Example 1. The otherexperiment to try is to see how the results change as the number of rolls increases. We will discuss this ideaof using more attempts to get better accuracy more fully in the next section but it is good to think abouthow many steps it takes to get the empirical distribution to look like the theoretical distribution.

(a) Individual Rolls (b) Empirical Distribution

(c) Average Roll Value (d) Error

Figure 3: Expected value of 500 rolls of a regular die. Part (a) shows the individual roll values and (b)shows the histogram of those rolls. Notice that even though the theoretical expectation is uniform thereare still differences between the number of times the values occurred in our experiment. Plot (c) shows howthe average value of the rolls changes as the experiment progressed, with a red line showing the theoreticalexpected value of .5 while (d) shows that the difference between the experimental value and the theoreticalvalue goes to zero.

3

https://people.csail.mit.edu/ddeford/die_expected

Example 2 (Scrabble Tiles). The game of Scrabble uses 100 square tiles that are drawn from a bag. Eachtile is labelled with a letter (or a space) and a number, which represents the score of the tile. The number oftiles and thee score of each letter are displayed in the table below:

Letter A B C D E F G H I J K L M NFrequency 9 2 2 4 12 2 3 2 9 1 1 4 2 6Score 1 3 3 2 1 4 2 4 1 8 5 1 3 1

Letter O P Q R S T U V W X Y Z ’ ’Frequency 8 2 1 6 4 6 4 2 2 1 2 1 2Score 1 3 10 1 1 1 1 4 4 8 4 10 0

Table 1: Frequencies and point values of Scrabble tiles.

The score of a word in Scrabble is just the sum of the scores of the corresponding letters. For example,the word “RANDOM” has score:

1 + 1 + 1 + 2 + 1 + 3 = 12.

Since we are mathematicians, we will usually use the word “word” in a more general sense to mean anystring of letters (and spaces). So for us, the string ‘AHC PF VD’ will be a 9 letter word with score:

1 + 4 + 3 + 0 + 3 + 4 + 0 + 4 + 2 = 21.

This fact that the tiles have scores allows us to compute the expected value (score) of a randomly drawn tile:

9

100·1+

2

100·3+

2

100·3+

4

100·2+

12

100·1+

2

100·4+

3

100·2+

2

100·4+

9

100·1+

1

100·8+

1

100·5+

4

100·1+

2

100·3+

6

100·1+

8

100·1+

2

100·3+

1

100·10+

6

100·1+

4

100·1+

6

100·1+

4

100·1+

2

100·4+

2

100·4+

1

100·8+

2

100·4+

1

100·10+

2

100·0 =

187

100= 1.87

A Scrabble version of the expected value experiment, using the tile values for scores, is presented belowin Figure 4. Comparing the table to the histogram, we can check that there are many more tiles with score 1in the scrabble bag than any other value, which is reflected in our experiment. Notice that again after about500 steps the empirical average has converged to very near the expected value of 1.87. You can design yourown experiments using Scrabble tiles with the code here.

(a) Empirical Distribution (b) Expected Value

Figure 4: Results of simulating drawing 1,000 Scrabble tiles (with replacement). The distributions observedin this experiment are quite close to those calculated theoretically.

4

https://people.csail.mit.edu/ddeford/scrabble_expected

3 Monte Carlo Sampling

Monte Carlo methods are built to take advantage of the fact that some things are hard to compute exactlybut easy to evaluate individual examples. This idea was formalized as a result of Stan Ulam’s work in theManhattan project but examples of these approaches for estimating computationally intensive values goback hundreds of years. Ulam was originally wondering about the probability of winning a particular typeof non–deterministic solitaire game2 and realized that he could estimate the probability, using the computerto shuffle the deck and evaluate whether or not the game was winnable.

To imagine a simple version of this, consider a “game” where you are presented with a shuffled deck ofcards and win if the top three cards are in increasing order and lose otherwise. What is the probability thatyou win with a randomly shuffled deck? In principle, we could try to compute all the ways to form a 3-cardincreasing sequence and divide by all the ways to shuffle the deck:

52! = 80658175170943878571660636856403766975289505440883277824000000000000

While this seems hard in general, for any given shuffle it is easy to check whether or not you won – justlook at the top three cards. This suggests a possible method for estimating the actual win rate, shufflingthe deck many times and just computing whether or not the top three card increase. Figure 5 shows anexample of this experiment and you can try it yourself here here. Notice that this is similar to what we sawin the previous section, where rolling more dice (or drawing more tiles) made our empirical estimate of theexpected value better.

(a) 100 games (b) 1000 games (c) 100 games (d) 1000 games

Figure 5: Estimating the win rate of a simple, non–deterministic solitaire game. Plots (a) and (b) showswhether or not each individual game was won, while (c) and (d) shows the average win rate across all of thegames. As the number of steps increases, the win rate appears to converge to a final value of approximately.165.

The same general outline that we applied to solitaire is common to most examples of Monte Carlomethods. Extracting the important steps, we get an algorithm that looks something like:

1. Draw an (independent) sample from the set of all possibilities

2. Compute some measure for each sample

3. Repeat many times

4. Average/aggregate the derived data

Notice that this is exactly the approach that we used in the previous section to compute expected valuesfor dice rolls and scrabble draws in Section 2.2. In both cases, it is easy to repeatedly draw the samplesand measure the values and the final estimate converged rapidly to the correct theoretical value that wecalculated. The fact that it is usually simple to apply this approach to get good estimates of empirical valueshas led to it becoming one of the most common techniques in the numerical analysis tool kit. The examplesbelow explore a couple of other specific applications but every field that has a numerical component makesuse of this methodology at some point.

2A non–deterministic game is one in which the player cannot make any choices that impact the outcome of the game.Whether or not these should actually be called games is a philosophical question, not a mathematical one.

5

https://people.csail.mit.edu/ddeford/solitaire.html

Example 3 (Distances in a cube). Although these steps seem abstract we can apply them to a seeminglysimple example: What is the expected distance between two points randomly drawn in a unit cube? Althoughthis problem has a mathematical formulation∫ 1

0

∫ 1

0

∫ 1

0

∫ 1

0

∫ 1

0

∫ 1

0

√((x1 − y1)2 + (x2 − y2)2 + (x3 − y3)2dx1dx2dx3dy1dy2dy3

and a mysterious looking exact solution

4 + 17√

2− 6√

3 + 21 log(1 +√

2) + 42 log(2 +√

3)− 7π

105

this is a perfect problem to try out the Monte Carlo method3. Trying to solve the problem directly wouldrequire us to consider how to compare arbitrary pairs of points but it is simple for us to instead select pairspoints of uniformly in the cube – this is our method for sampling from the possible inputs (1). Then, it iseasy to measure the distance between each pair of points that we select – this is our measurement of the valueof our sample (2). Finally, we repeat this 1,000 times (3) compute the average across all of our trials (4).

These steps are summarized in Figure 6. In (A) we see 1000 pairs of points connected by lines. Part(B) shows the lengths of each line individually, these look randomly scattered in space but part (C) showsthe underlying structure of the average of the points. Although the distances between individual pairs can bequite far from the expected value, the average converges quite rapidly to ∼ 0.66233348... which is just a hairsmaller than the actual value of ∼ 0.662959.... If we continued to select pairs of points this average wouldcontinue to get closer and closer to the true value.

(a) n pairs of points (b) Distances between pairs (c) Average distance between pairs

Figure 6: This figure shows the steps of the Monte Carlo process for estimating the average distance betweenarbitrary pairs of points in a cube. Plot (A) shows 1000 randomly chosen pairs of points, plot (B) shows thedistances between those pairs and plot (C) shows the cumulative average of the distances. Notice that eventhough the pairwise distances seem to be random, the average converges rapidly to the correct value.

Example 4 (Estimating π). Another example4 is estimating the value of π, using the area of a circle. Inthis case, we can select individual points uniformly in the unit square (1) and for each point it is easy to tellwhether it is inside or outside the circle (2). Again, we repeat this 1,000 times (3) and divide the number ofpoints that landed inside the circle by the total (4). This example is shown in Figure 7. Part (A) shows 1000random points in the unit square while (B) shows the running proportion of those points that land inside thecircle converging to π.

3An interactive widget for exploring this problem is here.4An interactive widget for exploring this problem is here.

6

people.csail.mit.edu/ddeford/cube_dist.html

people.csail.mit.edu/ddeford/pisimple.html

(a) 1000 uniform points in the unit square (b) Proportion of points inside circle

Figure 7: This figure shows a Monte Carlo experiment for estimating π. We start by drawing 1,000 randompoints in the plane (A) and then calculate the proportion of those that lie within the circle (B).

4 Markov Chains

A Markov chain is a special sequence of random variables, where the distribution of each variable only dependson the outcome of the previous variable not any of the earlier outcomes. Examples include children’s gamessuch as Chutes and Ladders and Monopoly, where the probability of landing on a particular square on yourturn is completely determined by your current square, as well as more complex processes such as Google’soriginal PageRank algorithm, which models the behavior of a random web–surfer moving from page to pageby clicking hyperlinks.

The simplest case is when each variable is drawn from the same distribution, i.e. there is no dependenceon the previous values at all. This is the case for our dice and scrabble examples - as long as we roll the samedie or replace the tile that we drew at the previous step, the distribution is the same for every experiment.In this case, every random variable is the same, so it definitely doesn’t depend on the output of the previousprocess and hence satisfies the Markov condition.

For an example where the variables do depend on the previous outcomes, consider an ant walking on akeyboard. Our state space will be the individual letters and the space bar.At each step, the ant can movefrom its current key to an adjacent one, chosen uniformly from those neighbors. For example, if the ant ison the ‘E’ key it can move to any of ‘W’, ‘S’, ‘D’, or ‘R’ with probability 1

4 for each while if it is on ‘H’ itcould move to any of ‘Y’, ‘G’, ‘B’, ‘N’, ‘J’, or ‘U’ with probability 1

6 . Figure 8 shows the letters and theirconnections and an animation of the ant walking can be viewed here.

(a) Keyboard Adjacency (b) Transition Probabilities

Figure 8: The adjacency structure of a standard keyboard. We can define a Markov chain by movinguniformly between adjacent keys with probabilities shown in (B).

7

https://github.com/drdeford/MCMC_Intro/blob/master/keyboard_walk.gif

To interpret this as a Markov chain, note that each key corresponds to a particular distribution over itsneighbors that only depends on where the any is currently standing, not how it got to that square. Thus, theant walk satisfies the Markov property, since the possible next steps only depend on the key it is currentlystanding on. Additionally, for any pair of keys α and β, we can compute the probability that the ant stepsfrom α to β by:

P(α 7→ β) =

1

#neighbors of αif α is next to β

0 otherwise.

These transition probabilities are enough to determine the full structure of the Markov chain, since for eachstate, we know exactly what the likelihood is that we will end up in any other state.

All Markov chains (on a discrete state space) can be formulated in a similar fashion. We begin byspecifying the set of possible objects, like the squares on the chutes–and-ladders board or the symbolsappearing in a text, which are referred to as “states”. Then for each state we describe the probability oftransitioning to each other state. This is all the information that is needed to describe a Markov chain.Frequently, these are represented by a directed graph, where each node represents a single state and thedirected edges represent the probability of transitioning between the states.

It is common to relate the Markov chain to a “random walk” on this type of digraph. We imagine awalker beginning on an arbitrary vertex and walking along the edges of the graph, visiting one node at eachtime step. To choose the next state to visit, the walker chooses from among the directed edges leaving thecurrent node, proportional to their weights. To see that this process satisfies our definition of a Markovchain, note that the walker only uses information about the edges leaving the current node to determine hispossible transitions, so the random variable describing the walker’s path only depends on the current state.We will return to this terminology several times throughout the remainder of this article.

Example 5 (Alphabet Paths and Cycles). Again using the alphabet as our state space, we will considerbuilding a couple of particularly simple Markov chains. The first, which we will call the alphabet path, willform a graph where each letter is connected to its neighbors in the alphabet with the space ‘ ’ occurring afterthe ‘z’. For example, ‘a’ will only be connected to ‘b’, but ‘g’ will be connected to ‘f ’ and ‘h’. The second,which we will call the alphabet cycle, will have a similar structure except we will connect the ‘a’ and ‘ ’ statesso that the graph forms a ring. Figure 9 shows the two underlying graphs for our Markov chains.

(a) Path (b) Cycle

Figure 9: The alphabet path and cycle graphs.

The transition probabilities for these Markov chains are easy to express. For the cycle, the probability oftransitioning to any neighbor in the alphabet is exactly 1

2 and 0 to transition to any non–neighbor. For thepath, this is true for every state except for ‘a’ and ‘ ’ for which the transition probabilities are 1 to ‘b’ and‘z’ respectively and 0 for anything else. We will make use of these examples again in the following sectionwhen we consider how Markov chains can be used to study distributions over state spaces.

8

Example 6 (Text Generation). One of the first applications of Markov chains was the analysis of textpassages, trying to predict the next letter that would appear in a book written by a given author. Similarmethods are used for auto–complete functions in today’s cellphones. Symbols in text are not distributeduniformly, as q is almost always followed by u, periods are followed by spaces, and the letter “e” is mostcommonly found at the end of a word. Given a long passage of text, we can compute how often each symbolfollows each other symbol and use these proportions to generate new text. In this case, a state is the currentletter and the probability for choosing the next letter only depends on the current letter.

While using single characters does not return recognizable words, we can extend our states to includeseveral sequential characters to get more interpretable results. Below are some examples generated using thestory “Aladdin and the Magic Lamp” from the Arabian Nights using single characters, pairs of characters,triples of characters, and quadruples of characters. The first line is 50 characters chosen uniformly, forcomparison. The single line transition probabilities are shown in Figure 10(a) below while Figure 10 (b)shows the digraph for the symbol transitions in Aladdin.

(Uniform) ,Kni;;.RgkY:f;;.?ACKKDFtjaBD-vjaIAezAFO-hOzOe?NAm

(1) y mpo fewathe he m, main, wime touliance handddd

(2) If ho rembeautil wind was nearsell ith sins. He don the whimsels hed his

the my mign for atim, but

(3) but powerful not half-circle he great the say woman, and carriage, she sup

window." He said feast father; "I am riding that him the laden, while

(4) as he cried the palace displeasant stone came to him that would not said:

"Where which was very day carry him a rocs egg, and horseback."

Then the might fetched a napkin, which were hunting in the

(a) Transition Probabilities (b) Digraph

Figure 10: Plot (A) shows the heatmap of the letter to letter transition probabilities in Aladdin and theMagic Lamp. Notice that there is almost no probability of transitioning from a capital letter to anothercapital and that the most common symbol after “A” is “l” from the title character’s name. Plot (B) is theassociated digraph with edge widths proportional to importance.

There turns out to be a significant amount of interesting structure in this type of analysis, as simplylooking at the transition matrices can frequently be enough to distinguish authors from each other or poetryfrom prose. Some cleaned collections of copyright free texts, along with some MatLab code for extracting thetransition probabilities and analyzing the corresponding Markov chains, can be downloaded here. A guideto using the code is here and an essay prompt making use of this data that I have assigned in classes onmathematical modeling can be seen here.

9

https://math.dartmouth.edu/~m36f17/Essay_3_Code.zip

https://people.csail.mit.edu/ddeford/Essay3_MatLab_Guide.pdf

https://math.dartmouth.edu/~m36f17/assignments.php

4.1 Markov Distributions

In our introduction to Markov chains above, we mostly focused on the trajectories of individual samples fromthe chain, like the actual keys stepped on by an ant or the specific letters drawn from a text. To connectthese chains to our discussion of probability distributions in Section 2 we first observe that we can use thetransition probabilities to compute the distribution over the state space at each step of the Markov chain.That is, while previously we imagined the ant on the ‘q’ key flipping a coin and stepping to ‘w’ if the coinwas heads and ‘a’ if the coin was tails, instead we simply note that at the next step, the ant is at state ‘a’with probability 1

2 and at ‘w’ with probability 12 .

Thinking about Markov chains in this probabilistic way allows us to start to answer questions aboutwhere we might expect the ant to be after a hundred or a thousand steps, by assigning a probability toeach possible outcome in the state space. Thinking back to Section 3, we might simply suggest running theexperiment thousands of times, placing the ant on the ‘q’ and letting it take a hundred steps and recordingthe final position. Figure 11 shows an example of this experiment on the keyboard, path, and cycle graphsand you can try it out yourself here.

(a) Expected Path (b) 100 Path Steps (c) 1000 Path Steps

(d) Expected Cycle (e) 100 Cycle Steps (f) 1000 Cycle Steps

(g) Expected Keyboard (h) 100 Keyboard Steps (i) 1000 Keyboard Steps

Figure 11: Example walks on the keyboard, path, and cycle graphs. For each experiment we started at theletter ‘a’ and took 100 or 1000 steps of the Markov chain, recording the final state reached. For each walk,we repeated the experiment 1000 times and plotted the resulting frequencies. The left column shows thelong term expected behavior of the steady state distribution.

A potential issue with this experimental method is that instead of trying to estimate a single number likethe expected value, we are trying to estimate a probability for each element of the state space. This gets usright back in to the problem that we were trying to avoid by sampling. Luckily, for our small examples, thealphabet only has 26 letters, so we might feel that we are getting reasonable results after a few thousandsteps. However, if we are instead considering all possible 267 = 8, 031, 810, 176 7 letter words, we shouldfeel less confident. Additionally, the fact that we only observe “even–numbered” letters in the path walksuggests that chain has a special property we should investigate further. We will turn to this in Section 4.1.1but next we will discuss a technique to determine these long-term probabilities without sampling.

In order to probe this question more generally, imagine that our state space is some set of n items

10

https://people.csail.mit.edu/ddeford/graph_sampling

S = {s1, s2, . . . , sn} and we have computed the transition probabilities between each pair of states Ti,J .Then, given a probability vector P whose entries Pi tell us the current probability that we are in anyparticular state i, we can compute the probability that we are at state j in the next step of the Markov chainas:

Qj =

n∑i=1

PiTi,j .

That is, summing up over all states i, the probability that we were at i and then transitioned to j. If werecord the initial state/distribution of our Markov chain as a vector P 0, then we can compute the probabilitydistributions at each successive step recursively with the same operation:

P k+1j =

n∑i=1

P ki Ti,j .

This mathematical formalism will help us understand the general properties of Markov chains and provideus with some useful notation. In particular, these probabilities are exactly what we were trying to estimatein Figure 11. For those of you with some exposure to linear algebra, note that this is just the matrix–vectorproduct of the transition matrix with the probability vector.

We can explore this a little more closely using the example chains we introduced in the previous section.For the path, cycle, and keyboard walks, we again begin at ‘a’ but instead of sampling individual states,we instead use the equation above to compute the exact probabilities of arriving at each other state. Theseexperiments are summarized in Figure 12 below and show the slow diffusion of probabilities across the letters.You can experiment with these examples yourself here. It is particularly enlightening to compare differentstarting states across the various Markov chains.

(a) 10 Path Steps (b) 20 Path Steps (c) 50 Path Steps (d) 1000 Path Steps

(e) 10 Cycle Steps (f) 20 Cycle Steps (g) 50 Cycle Steps (h) 1000 Cycle Steps

(i) 10 Keyboard Steps (j) 20 Keyboard Steps (k) 50 Keyboard Steps (l) 1000 Keyboard Steps

Figure 12: Probabilistic versions of Figure 11. We compute the probabilities of arriving at each letter ina walk of length k (y–axis) starting at ‘a’. Unlike the previous approach, these plots represent the exactprobabilities.

11

https://people.csail.mit.edu/ddeford/walk_distributions

These figures highlight a fascinating property of many Markov chains: No matter what initial distributionis used to start the chain, after a sufficiently large number of steps, the distribution will converge to a fixed,stationary distribution and continue to remain at that distribution no matter how many further steps aretaken. In order to describe the types of chains that have this property, we need to discuss a discuss propertiesof some Markov chains that make them amenable to analysis.

4.1.1 Periodicity

The first adjective we will consider is periodicity. The period of a Markov chain is the greatest commondivisor of all possible lengths of cycles that occur in the graph associated to the chain. A chain is said to beirreducible if the period is one. Looking at our example chains, we can see that the keyboard is irreduciblesince ‘q’-‘w’-‘q’ is a length 2 cycle and ‘q’ - ‘w’ - ‘a’ - ‘q’ is a length 3 cycle. Similarly, the cycle chainis aperiodic since ‘a’-‘b’-‘a’ is a length 2 cycle and following the entire cycle around is a length 27 cycle.However, the path chain has period 2, since if we number every letter with a=1, b=2, etc. then every stepof the chain takes us from an even number to an odd number or vice versa and hence every cycle has evenlength.

This property explains the “oddness” of the plots in Figure 11 (b) and (c), since we sampled 100 and1000 step paths, we could only get outputs that were even numbered letters. In order to convert a periodicchain to an aperiodic one we can add a small probability of remaining in place to each step of the walk, sinceadding length one cycles forces the gcd to be one. A walk with such a probability of not moving at each stepis frequently called a “lazy” walk and adding these “waiting probabilities” is a common trick in this setting.

4.1.2 Reducibility

In addition to periodicity, an important feature of Markov chains is irreducibility. A Markov chain isirreducible if each state can be reached from any other state in a finite number of steps. All of the examplesthat we have encountered so far have this property. For example, on the keyboard, an ant starting at anykey can reach any other key in at most 9 steps, while on the path, it takes 26 steps to get from ‘a’ to ‘ ’.

4.2 Ergodicity

Markov chains that are both aperiodic and irreducible are called ergodic. These chains have the propertythat there is a unique stationary distribution that is the limit of the probability distributions formed byiterating the Markov chain. A stationary distribution P , satisfies the property that:

Pj =

n∑i=1

PiTi,j

for each state j. In words, this says that applying the Markov chain (equivalently, the transition probabilities)to the steady state probabilities leaves those probabilities unchanged.

For chains that are formed by simple random walks on graphs, like our path, cycle, and keyboardexamples, the probabilities in this distribution are proportional to the number of neighbors of each state inthe graph. This explains the left column of Figure 11, the states of the keyboard with the most neighborsand corresponding largest probabilities are s, d, f, g, h, j in the center row, while the keys with the fewestneighbors and smallest probabilities are ‘p’ and ‘q’ in the corners. On the other hand, in the cycle, everyletter has two neighbors, so the distribution is uniform.

Even better than the existence and uniqueness of stationary distributions, given a function defined onour state space that we are interested in evaluating, if we draw samples according to the Markov chain, theaverage of the function values evaluated on the samples converges to the expected value over the stationarydistribution. We saw some trivial examples of this already with our experiments in Section 2.2, recallingthat repeated draws from a fixed distribution form a simple Markov chain.

12

We now define score functions to evaluate on our simple Markov chains:

• Uniform: We assign each letter a score of 1.

• Scrabble Points: We assign each letter the score that is on the tile, as in Section 2.

• Scrabble Count: We assign each letter the number of tiles with that letter that appear in the Scrabblebag, as in Section 2.

• Alphabetical: We assign each letter a score based on its position in the alphabet ‘a’=1, ‘b’=2, ...,‘z’=26, and ‘ ’=27.

• Vowels: We assign 1 to each consonant, 100 to each value, and 50 to ‘y’.

Using these score functions, we can use our simple Markov chains to compute the expected values ofthe scores with respect to the stationary distribution of the chain. Figure 13 shows these results for thethree chains and four non–uniform score functions. The theoretical expected values are shown in red andthe empirical values do appear to converge in each of our experiments, verifying the theorem statementreferenced above. You can try this out on your own here, it is particularly instructive to adjust the lengthof the different chains and see how that impacts the accuracy of the final estimate.

(a) Path –Points (b) Path – Count (c) Path – Alphabet (d) Path – Vowel

(e) Cycle – Points (f) Cycle – Count (g) Cycle – Alphabet (h) Cycle – Vowel

(i) Keyboard – Points (j) Keyboard – Count (k) Keyboard – Alphabet (l) Keyboard – Vowel

Figure 13: Estimating expected values with three walks and four scores. For each of our three simple Markovchains on letters and the Scrabble Points, Scrabble Count, Alphabetical, and Vowel score functions, we usethe Markov chain to estimate the expected value of the score across the steady state distribution.

While it is true that the values in each experiment eventually become a good estimate for the actualvalue, it is not true that each provides the same accuracy given a fixed number of steps. Table 2 belowcompares the theoretical expected values to the estimates obtained from each Markov chain by increasingthe number of samples. These experiments can be reproduced here including the actual values at each stepin the chain, as in Figure 3. Note that we should also not expect perfect accuracy over these lengths due tocorrelation between steps and the effects of the initial samples drawn before reaching the steady state.

13

https://people.csail.mit.edu/ddeford/mc_ev


Walk Score Actual Value 2k steps 10k steps 50k steps 100k steps

Keyboard Points 3.075 3.129 (1.7%) 3.067 (0.3%) 3.089 (0.4%) 3.072 (0.1%)Keyboard Count 3.633 3.733 (2.7%) 3.587 (1.3%) 3.644 (0.3%) 3.638 (0.1%)Keyboard Alphabet 13.292 13.12 (1.2%) 13.34 (0.4%) 13.32 (0.2%) 13.30 (0.05%)Keyboard Vowels 19.13 21.02 (9.9%) 19.36 ( 1.2%) 19.53 (2.1%) 18.91 (1.2%)

Cycle Points 3.22 3.17 (1.7%) 3.18 (1.2%) 3.19 (0.9%) 3.20 (0.7%)Cycle Count 3.70 4.00 (8%) 3.88 (4.9%) 3.71 (0.3%) 3.71 (0.2%)Cycle Alphabet 14 14.5 (3.5%) 14.32 (2.3%) 14.1 (0.7%) 13.88 (0.8%)Cycle Vowels 21.15 21.36 (1.0%) 21.76 (2.9%) 20.97 (0.8%) 21.22 (0.4%)

Path Points 3.327 3.67 (10.3%) 3.18 (4.5%) 3.33 (0.02%) 3.31 (0.4%)Path Count 3.635 3.85 (5.9%) 3.68 (1.2%) 3.69 (1.6%) 3.59 (1.0%)Path Alphabet 14 15.75 (12.5%) 13.99 (0.07%) 14.04 (0.3%) 14.07 (0.5%)Path Vowels 20.02 18.69 (6.6%) 19.70 (1.6%) 19.29 (3.6%) 20.03 (0.07%)

Table 2: Convergence time comparison for expected value experiments. Percent error in parentheses.

Looking over this data suggests that it takes many more samples from the path chain to get a goodestimate for the expected value than for the other walks. The speed at which the Markov chain convergesto its stationary distribution from an arbitrary starting point is known as the mixing time of the chain. Ifyou experiment with the code provided here you will see that not only does the path require more steps toachieve a particular level of accuracy on average but also displays a higher variance. Mixing times are ofgreat practical importance because they determine the how many samples must be drawn to guarantee goodconverge of our estimators.

4.2.1 Reversibility

A Markov chain is reversible if its steady state distribution satisfies a symmetry condition known as detailedbalance. This condition states that in the steady state, the probability of being at state i and transitioningto state j is equivalent to the probability of being at state j and transitioning to state i. In mathematicalnotation, if P represents the steady state vector and T the transition matrix, then

PiTi,j = PjTj,i.

Reversible chains have many nice properties and this symmetry condition means that the steady statedistributions are particularly easy to analyze. Additionally, simple random walks on graphs are automaticallyreversible. Thus, instead of having to check that a Markov chain satisfies the detailed balance conditionabove, we can instead construct a graph that has that chain as its simple random walk. The Aladdin textchain in Example 6 is an example of a non–reversible chain.

5 Markov Chain Monte Carlo

It is finally time to put everything together. The key step of MCMC is to create an irreducible, aperiodicMarkov chain whose steady state distribution P is a specific distribution we are trying to sample from. Thesame property that made problems tractable for Monte Carlo analysis (ease of evaluating the properties ofa sample) also turns out to be useful for constructing Markov chains to draw from a specific distribution.In our Monte Carlo methods we required that the samples from our space were drawn uniformly but this isnot always easy or desirable. Using a Markov chain to select our samples gives us a way to sample from adesired distribution (the steady state of the Markov chain) without having to know the specific probabilitiesassociated to that distribution, just the transition probabilities.

This was the key idea that was exploited by Metropolis and coauthors5 in 1953 to combine these methodsto form what we now call MCMC. As with the original Monte Carlo method, this was yet another collab-orative success of the Manhattan project, applied to the statistical mechanics of atomic particles. These

5This was another success story from the Manhattan Project

14


ideas were further developed by Hastings and others and have come to be one of the most fundamentalcomputational tools in all of computer science and statistics. In 2000, the IEEE described MCMC samplingas one of the top 10 most important algorithms of the 20th century.

In most cases, the transition probabilities are determined by a multiple of the distribution that we aretrying to sample from. Although this seems like an odd condition: “How could we ever know somethingproportional to a distribution without knowing the distribution itself?” this turns out to be a commonsituation in many examples in physics as well as Bayesian Statistics. For our purposes, this arises when wehave a score function or a ranking on our state space and want to draw proportionally to these scores i.e.to prioritize states that score “better” under our metrics of interest. This is particularly useful in settings,such as all words of length n where it would be difficult or impossible to compute the probabilities directly.

To place this in more mathematical language, imagine that we have a score function s, on our state spaceX, like the Scrabble values on tiles discussed in the previous section, so that s : X 7→ R. We then wantto sample from the distribution where the states appear proportional to s. That is, element y ∈ X shouldappear with probability

P(y) =s(y)∑x∈X s(x)

.

Figure 14 shows these distributions for our four score functions. When X is very large, there is no way forus to compute this denominator directly. However, notice that we can compute ratios of probabilities, sincethe denominators cancel:

P(z)

P(y)=

s(z)∑x∈X s(x)

s(y)∑x∈X s(x)

=s(z)

s(y).

This is the trick that turns out to allow us to draw samples according to s without having to compute thedenominator directly. Even more interestingly, Metropolis–Hastings MCMC works by transforming samplesdrawn using a known Markov chain into samples from a chain whose stationary distribution is proportionalto our score function.

(a) Points (b) Counts

(c) Alphabet (d) Vowels

Figure 14: Distributions over letters that are proportional to the Scrabble Points, Scrabble Counts, Alphabet,and Vowels score functions.

15

In order to perform the Metropolis–Hastings procedure we need a random proposal function Q whichdefines a distribution over the states given the current state and a score function s. We will usually take theproposal function to be a pre–defined Markov chain on the states or equivalently the transition probabilitiesat each state. That is, Qx,y = Tx,y. At each step of the MCMC process we will use Q to generate a newproposed state and compare the score of the proposal to the current state. If the score of the proposal ishigher, we proceed to the new state, otherwise we remain at the previous state with probability proportionalto the ratio of the scores. It is this possibility of remaining in place that transforms the stationary distributionto our desired values.

More formally, at each step of the Metropolis–Hastings chain X1, X2, . . . we follow this sequence of steps,assuming that we are currently at state Xk = y.

1. Generating a proposed state y according to Qy,y.

2. Compute the acceptance probability:

α = min

(1,s(y)

s(y)

Qy,yQy,y

)3. Pick a number β uniformly on [0, 1]

4. Set

Xk+1 =

{y if β < α

y otherwise.

This new Markov chain is ergodic and reversible with steady state distribution proportional to s. Thismeans that we can use it to analyze properties of this new distribution, even though we might not be able tocompute any of the probabilities directly. The software here allows you to vary the proposal distribution, scoredistribution, starting state, and length of the chain to gain some intuition for the MCMC transformation.

Figure 15 shows a first example of MCMC. Here the proposal distribution is drawing a tile from theScrabble bag and we want to sample proportionally to the Alphabetical Score. Notice that although thereare only 2 ‘ ’ tiles and 1 ‘z’ tile, the MCMC distribution sampled those states many more times than the ‘a’tile, even though the ‘a’ tile was drawn much more frequently. This is an example of how the Metropolis–Hasting procedure of rejecting transitions and remaining in place changes the final distribution of samples.Notice that we will never be forced to remain at the ‘a’ tile, since its score is the worst of all of the states.As in our expected value experiments above, increasing the length of the chain increases the accuracy of thedistribution.

(a) Proposed (b) 10k Accepted (c) 100k Accepted

Figure 15: The results of a length 10,000 MCMC run on letters. The proposals were generated according tothe Scrabble tile distribution and the score function is alphabetical. In order to achieve this transformation,many transitions away from the high scoring tiles had to be rejected while we rarely remain at lower scoringtiles. Taking increasingly long chains (c) makes the output distribution closer to our desired distribution.

16

https://people.csail.mit.edu/ddeford/MCMC_Letter

Having seen an example in action, we will step through some of the computations to see the numericalimpacts of weighting, using the keyboard walk and the Scrabble score function. We begin by imagining thatour current state is ‘a’. Following our algorithmic outline above:

1. We uniformly pick a key on the keyboard next to ‘a’: ‘q’.

2. We next need to compute some numbers in order to compute the acceptance probability:

• s(a) = 1

• s(q) = 10

• Qq,a = 12 since ‘q’ has two neighbors

• Qa,q = 13 since ‘a’ has three neighbors

These let us compute:

α = min(1,10

1

1213

) = min(1, 15) = 1

3. Uniformly pick β = .188256

4. Set the next state to be ‘q’ since β < α.

This makes sense with our interpretation, since ‘q’ has a higher score than ‘a’ we should always accept thistransition. Now we will try to take another step from ‘q’;

1. We uniformly pick a neighbor of ‘q’ and get ‘w’.

2. We again need to compute some numbers:

• s(q) = 10

• s(w) = 4

• Qq,w = 12 since ‘q’ has two neighbors

• Qw,q = 13 since ‘w’ has four neighbors

These let us compute:

α = min(1,4

10

1412

) = min(1, .2) = .2

3. Uniformly pick β = .7593544

4. Set the next state to be ‘q’ since β > α.

This time, we proposed a state with a lower score and the move was rejected so our current MCMC walkso far has visited states ‘a’, ‘q’, and ‘q’. However, there was a 20% chance that we would have acceptedthe proposal and moved to ‘w’ in order to more fully explore the space. Figure 16 displays the score valuessampled from a similar walk which displays the characteristic behavior of MCMC samples, exploring highvalues but also returning to low ones.

Figure 16: Traces of MCMC walks with the Alphabetical score function.

17

There are several natural cases in which the computation of the acceptance probability α can be simplified.The first is when we are trying to sample from a uniform distribution, in which case the score values areequal for all states and α is just equal to the ratio of the probabilities of transitioning between adjacentstates. It is an instructive exercize to calculate the α values for all states in the path Markov chain whenthe score function is uniform. Figure 17 shows a helpful hint for this problem. The other simplification iswhen the two way transition probabilities are always equal, as is the case in the cycle walk, since each statehas exactly two neighbors. In this case, the second term cancels and we are simply left with α as the ratioof the two score functions, which is usually simple to compute.

(a) Proposed (b) Accepted

Figure 17: Converting the path Markov chain to uniform. Plot (a) shows the proposed steps and plot (b)shows the accepted samples after 100,000 transitions. How does MCMC change the weights on the originalchain?

Recall that in this setting, our main goal is to draw samples from a predefined distribution. Thus, ourmetric of success is how closely the observed collection of samples matches the distribution proportional to s.One way to measure this is the Total Variation distance between the two measures. Given two distributionvectors P and Q, the total variation distance between them is

n∑i=1

|Pi −Qi|.

When this value is zero the two distributions are exactly the same. To see that our MCMC method isconverging our samples to the proper distribution, we can measure the total variation distance between ourempirical draws with MCMC and the theoretical expected distribution that is exactly proportional to ourscore function. Figure 18 below shows some of these results. The green dots represent the distance betweenthe MCMC distribution and the score distribution, while the blue dots represent the distance between theMCMC distribution and the steady state of the proposal distribution. As expected, the green dots go to zeroas the number of steps increases, while the blue dots are bounded away from zero as the chain converges.

Figure 18: MCMC convergence in total variation. These plots show the total variation distance between theMCMC distribution and the score distriubtion going to zero as the number of steps increases.

18

6 Lifted Walks

The MCMC techniques discussed above show one way to take samples from Markov chain and transformthem to reach a new stationary distribution. Another similar idea occurs in the setting of lifted Markovchains, where we will use an auxiliary graph to form a new chain on our original state space that convergesmore rapidly to the desired distribution. The key idea is that in well-behaved graphs (e.g. those with lots ofsymmetry) we can construct faster mixing chains by lifting to a larger graph where we already understandfast mixing walks. Although this is unintuitive at first glance, as we are moving to a setting with more nodesto try to move faster, this technique has been successfully applied in many combinatorial settings.

The purpose of this section is to explore Example 1.1 presented in the paper Lifting Markov Chainsto Speed up Mixing by Chen, Lovasz, and Pak. A GitHub repo with the source code for the correspondingSage Widget and some colorful schematic diagrams is here. Before considering the specific example, we givea few words about the lifting procedure for walks on graphs. Given a graph, G our current interest is indrawing samples from the stationary distribution associated to a specified walk on the graph. However, thechain may be slow mixing (i.e. it takes many steps/samples to converge to stationary) and we desire a morerapid procedure for sampling. The plan is to find a different graph, H, along with a projection π from thenodes of the new graph to those of the old. We then form a new chain by lifting from G to H using π−1 andthen taking a step on a faster mixing chain M on H before projecting back to G using π. This procedure issummarized in the following diagram:

H H

G G

M

ππ−1

M

This procedure defines a Markov chain (M) = πMπ−1 on G and if we choose H, π, and M appropriatelywe can guarantee both that the stationary distribution of M is equal to that of our desired walk on G andfavorably bound the mixing time.

For the specific example considered here G will be the path graph on n nodes Pn and H is the cycle graphon 2n − 2 nodes C2n−2. The simple random walk on the path has steady state proportional to the vector(1, 2, 2, . . . , 2, 1) while the stationary distribution on the cycle is uniform as long as each node has the sameprobability of stepping to the right6. We can choose any projection π : C2n−2 → Pn that maps exactly twonodes of the cycle to the interior nodes of the path and exactly one node each to the endpoints of Pn. Noticethat given such a π a sample drawn from the uniform distribution projects to a sample from the stationarydistribution on Pn. Figure 19 shows two possible cycle projections where the nodes from the cycle map tothe correspondingly colored nodes in the path. Note that the purple and red colors only occur once on eachcycle, while each other colors appear twice so that the stationary distribution is hit by the projection.

(a) P6 (b) C10 (c) C10

Figure 19: Permissible projections of the 10 cycle to the 6 path. The nodes on the cycles project to thecorrespondingly colored node on the graph.

6and hence to the left as well

19

http://www.math.ucla.edu/~pak/papers/stoc2.pdf

http://www.math.ucla.edu/~pak/papers/stoc2.pdf

https://people.csail.mit.edu/ddeford/lifted_walks.html

https://github.com/drdeford/path_lifted_walks

More formally, we form the lifted Markov chain by repeating the following sequence of steps, assumingthat we are currently at node i in the path at step n, so Xn = i.

1. Select a node p in the cycle uniformly from π−1(i)

2. Draw a uniform number z on [0, 1]

3. Move to a neighbor q of p:

q =

{p+ 1 (mod 2n− 2) if z < σ

p− 1 (mod 2n− 1) otherwise

4. Project back to the path Xn+1 = π(q)

5. Repeat with Xn+1 = π(q)

The schematic in Figure 20 shows an example step of this procedure with G = P6 and H = C10 usingthe projection defined in Figure 19 (c).

Figure 20: A sample step in the lifted Markov chain. We are at the green node of the path, so we begin byselecting one of the green nodes in the cycle at random. To complete the step, we move to an adjacent nodeon the cycle and then project back to the path.

The claim that we actually want to evaluate is that this new walk is faster, even though the cycle graphhas approximately twice as many nodes as the path. To empirically demonstrate this, we can run bothchains in parallel and compare their convergence times to the stationary distribution using the TV distanceintroduced above. Figure ?? shows an example of this, comparing the simple random walk on P50 to thelifted walk through C98. As expected, the lifted walk converges to stationary much more rapidly than theoriginal walk. You can try varying the parameters yourself here.

(a) Path Walk (b) Lifted Walk (c) Mixing Time

20

https://people.csail.mit.edu/ddeford/lifted_walks.html

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	11 times
Download:	0 times

Introduction to Discrete MCMC for RedistrictingIntroduction to Discrete MCMC for Redistricting Daryl...

Documents