Markov Chain Monte Carlo (MCMC): Non-Technical Overview in Plain English

Sampling > Markov Chain Monte Carlo

What is Markov Chain Monte Carlo?

Markov Chain Monte Carlo is a way to sample from a complicated distribution. What is a “complicated distribution”? One that’s difficult, or near impossible to find probabilities for. These types of problems are called “high-dimensional” problems. One example of a high-dimensional problem is calculating the relative importance of pages on the World Wide Web. Not only are their billions of pages, but they are in a constant state of change. Markov Chain Monte Carlo can solve these types of problems in a reasonable amount of time. Markov Chain Monte Carlo isn’t a specific technique: it’s actually a large class of sampling algorithms that includes:

What is “High Dimensional”?

The key to understanding MCMC is to understand high dimensional space. It’s an entire chapter in some computer science books, so I won’t go into more than just a brief overview here. According to Ramon van Handel, a mathematics professor at Princeton, there is no good answer to the question “What is high-dimensional” [1]? In a sense, high-dimensional could be taken as a synonym for “very complicated,” and includes many problems found in numerous areas of math and science, such as

- High-dimensional data in machine learning, which has a large number of features or attributes.

- Large random structures such as random matrices and random graphs.

- Models of disordered systems in statistical physics, such as Gibbs measures, percolation, and spin glasses.

- Probability in Banach spaces.

- Random codes in information theory.

Randomized algorithms in computer science.

Basically, high-dimensional space is anything beyond the 2-D and 3-D world we’re used to dealing with. High-dimensional space doesn’t behave the way we think it should. For example, a 4-dimensional cube, the hypercube, collapses in on itself as it rotates. Similarly, a high-dimensional unit-radius sphere has a volume of zero; Cubature (numerical integration) is needed to find these volumes. A high dimensional math problem has data that lies on high-dimensional spaces. Therefore, the data doesn’t “behave” the way we think it should. High-dimensions are difficult to comprehend (it’s said that even Einstein had trouble visualizing hypercubes, although he did perform the math behind four and five dimensions) and their space increases exponentially with each additional dimension–making it practically impossible to add any numerical meaning to the results. It follows then, that finding probabilities for high-dimensional data is extremely difficult. The usual probability rules don’t apply, so new ones are needed, like MCMC. Note: If you’re interested in finding out more about high-dimensional space, I recommend reading the first couple of chapters in Foundations of Data Science (downloadable pdf — free).

Markov Chains

Markov chains are used to model probabilities, where all of the information needed to predict future events can be encoded in the current state. You can find a simple, step-by-step example, using two dice to make predictions, in the Hidden Markov Model article. Basically, two colored die represent the current state (red or black) and a colored ball chosen from a bag gives the probability of transitioning to the next state. In real life, things get a little more complicated than two dice and a box of balls. For instance, in speech analysis, the last k syllables spoken can determine the probability of the next word. In speech, the next word depends only upon the current state (i.e. ,what’s being said at the moment). For example, if the current state is your mother saying “Today I went for a…,” you can predict the next word with a reasonably high probability. However, the reality is that in language, the number of possible states for the last k-syllables could reach into the millions, which would be impossible to analyze by hand and difficult for standard computer algorithms to compute. Instead, the information is organized in matrices and probability vectors. Through numerous iterations, a collection of probability vectors forms the Markov chains. More technically, this information can be placed into a matrix and a vector(a column matrix) called probability vectors. Many, many, many iterations later and you have a collection of probability vectors…making up the Markov chains.

A practical example of Markov Chain Monte Carlo

The purpose of Markov Chain Monte Carlo is to sample an extensive sample space, such as the World Wide Web, which contains an immense amount of data. This analysis of a complicated distribution is a key component for search engine algorithms, like Google, to determine the importance of webpages. The process involves a bot embarking on a random walk of the web, following links without any predetermined path. These random walks are logged in matrices, alongside the probabilities of reaching specific pages. Picking a name from a hat or using simple random sampling is not feasible in this scenario. Defining the population in an exponentially growing environment like the current World Wide Web is challenging: even if we manage to define the existing web pages, the continuous generation of new content would mean that the new content has zero probability of being selected — which is statistically unacceptable. Markov Chain Monte Carlo offer an alternative solution.

Why Markov Chain Monte Carlo?

OK, so we have a complicated distribution — like the world wide web. MCMC is a way to efficiently sample from this complicated distribution. Simple random sampling or similar methods (like picking a name from a hat) won’t work, because you won’t be able to easily define your population because it’s growing exponentially. Even if you did find a way to define the pages on the current WWW, the millions of pieces of new content generated in one minute from now will have a zero probability of being chosen. From a statistical standpoint, this is unacceptable. Enter Markov Chain Monte Carlo. The “Monte Carlo” part of MCMC is just a moniker for the type of sampling we’re doing — which is sampling based on random walks. Monte Carlo is a gambling resort in Monaco, so it could just as easily have been called “Markov Chain Las Vegas” or just “Markov Chain Sampling.” The name could just as easily have been “Markov Chain Las Vegas” or “Markov Chain Sampling.” The choice of the name is not important, as long as the method is understood.

MCMC and Limiting Distributions

MCMC uses random walks (essentially a huge random number generator) to estimate probabilities with the same limiting distribution as the random variable you’re trying to simulate. As time n > ∞, a Markov chain has a limiting distribution π = (π_j)_j∈s. A value for π can be found by solving a series of linear equations.

Watch this video on YouTube

References

Van Handel, R. (2016). Probability in High Dimension.