Markov Chain Monte Carlo
Markov Chain Monte Carlo:
Markov Chain Monte Carlo (MCMC) is the silver bullet of probabilistic modeling. This post explores what MCMC is, introduce how exploit specifics of the problem to speed up MCMC, and understand its limitations.
1. Monte Carlo Estimation.
What is the Monte Carlo estimation method? The Monte Carlo estimation is a method in which the expected values are estimated by sampling (average over the number of samples times the sum of the sampled value from distribution p(x) for the function f(x) for more detail). The number of samples should be big enough to estimate the expectation.
Why do we need to estimate expected values? One is the M-step of the EM algorithm . Other example is for the full Bayesian inference.
2. Sampling from 1D distribution.
The Monte Carlo estimate requires the sampling. Lets look at the discrete case first. Take the discrete distribution in slide 1 as an example. As the sum of all the p is 1, we map the values 0 to 1 to corresponding discrete random values such as a1, a2 or a3. Taking a sample from a uniform distribution, such 1D discrete distribution can be sampled. 1D discrete distribution with the finite number of values are easy to sample assuming the number of values are less than for example 100000.
For the continuous 1D distributions lets see an example where we want to sample from the Gaussian distribution. One can use the central limit theorem which states that when independent random variables are added and normalized, it tends towards the normal distribution. Note that -6 is subtracted as 0.5 (mean of the uniform distribution) *12 (number of samples) results in 6.
What if the continuous distribution we want to sample from looks like the one in slide 3? Then, we first approximate Gaussian to be its upper bound by multiplying 2 for example. Then, we sample from this Gaussian distribution, and accept it only when it is lower than p(x), the original distribution we wanted to sample it from, and reject all the other points. This method is called the rejection sampling with its pros and cons listed in the last slide.
3. Markov Chains.
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. A process satisfies the Markov property if one can make predictions for the future of the process based solely on its present state just as well as one could knowing the process's full history, hence independently from such history; i.e., conditional on the present state of the system, its future and past states are independent. An example can be found in slide 2 where frogs position is a random variable with either on left lotus or right lotus.
How can we sample using this Markov Chain. To sample from the distribution p(x) one can build a Markov chain that converges to p(x). Starting from any condition x(0) we continue upto k so that eventually x(k) will look like samples from p(x). A Markov Chain converges when none of the probabilities from one to another state is zero.
4. Gibbs Sampling.
In statistics, Gibbs sampling or a Gibbs sampler is a MCMC algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variable); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled. See the slide below which presents the Gibbs sampling algorithm for 3D joint distribution.
As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired. Generally, samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution and are usually discarded. It has been shown, however, that using a longer chain instead (e.g. a chain that is n times as long as the initially considered chain using a thinning factor of n) leads to better estimates of the true posterior. Thus, thinning should only be applied when time or computer memory are restricted
To see the demonstration of Gibbs sampling see the slide below. For simplicity we consider 2D Gaussian distribution. First randomly initialize x1 and x2 (slide 1). Then, we sample from the distribution x1 given x2. Then, in the next substep of the same step, we sample from the distribution x2 given x1. Repeating them upto k steps, we eventually sample from this joint distribution.
5. Metropolis Hasting.
The Metropolis–Hastings algorithm is a MCMC method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sequence can be used to approximate the distribution, or to compute the expected value. Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. The formal algorithm is shown below.
The demonstration can be found in the slide below which is self explanatory.
Markov Chain Monte Carlo (MCMC) is the silver bullet of probabilistic modeling. This post explores what MCMC is, introduce how exploit specifics of the problem to speed up MCMC, and understand its limitations.
1. Monte Carlo Estimation.
What is the Monte Carlo estimation method? The Monte Carlo estimation is a method in which the expected values are estimated by sampling (average over the number of samples times the sum of the sampled value from distribution p(x) for the function f(x) for more detail). The number of samples should be big enough to estimate the expectation.
Why do we need to estimate expected values? One is the M-step of the EM algorithm . Other example is for the full Bayesian inference.
2. Sampling from 1D distribution.
The Monte Carlo estimate requires the sampling. Lets look at the discrete case first. Take the discrete distribution in slide 1 as an example. As the sum of all the p is 1, we map the values 0 to 1 to corresponding discrete random values such as a1, a2 or a3. Taking a sample from a uniform distribution, such 1D discrete distribution can be sampled. 1D discrete distribution with the finite number of values are easy to sample assuming the number of values are less than for example 100000.
For the continuous 1D distributions lets see an example where we want to sample from the Gaussian distribution. One can use the central limit theorem which states that when independent random variables are added and normalized, it tends towards the normal distribution. Note that -6 is subtracted as 0.5 (mean of the uniform distribution) *12 (number of samples) results in 6.
What if the continuous distribution we want to sample from looks like the one in slide 3? Then, we first approximate Gaussian to be its upper bound by multiplying 2 for example. Then, we sample from this Gaussian distribution, and accept it only when it is lower than p(x), the original distribution we wanted to sample it from, and reject all the other points. This method is called the rejection sampling with its pros and cons listed in the last slide.
3. Markov Chains.
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. A process satisfies the Markov property if one can make predictions for the future of the process based solely on its present state just as well as one could knowing the process's full history, hence independently from such history; i.e., conditional on the present state of the system, its future and past states are independent. An example can be found in slide 2 where frogs position is a random variable with either on left lotus or right lotus.
How can we sample using this Markov Chain. To sample from the distribution p(x) one can build a Markov chain that converges to p(x). Starting from any condition x(0) we continue upto k so that eventually x(k) will look like samples from p(x). A Markov Chain converges when none of the probabilities from one to another state is zero.
4. Gibbs Sampling.
In statistics, Gibbs sampling or a Gibbs sampler is a MCMC algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variable); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled. See the slide below which presents the Gibbs sampling algorithm for 3D joint distribution.
As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired. Generally, samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution and are usually discarded. It has been shown, however, that using a longer chain instead (e.g. a chain that is n times as long as the initially considered chain using a thinning factor of n) leads to better estimates of the true posterior. Thus, thinning should only be applied when time or computer memory are restricted
To see the demonstration of Gibbs sampling see the slide below. For simplicity we consider 2D Gaussian distribution. First randomly initialize x1 and x2 (slide 1). Then, we sample from the distribution x1 given x2. Then, in the next substep of the same step, we sample from the distribution x2 given x1. Repeating them upto k steps, we eventually sample from this joint distribution.
5. Metropolis Hasting.
The Metropolis–Hastings algorithm is a MCMC method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sequence can be used to approximate the distribution, or to compute the expected value. Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. The formal algorithm is shown below.
The demonstration can be found in the slide below which is self explanatory.
Comments
Post a Comment