Expectation Maximization Algorithm

Expectation Maximization Algorithm:

The general form of the EM algorithm is discussed in this post so that every latent variable model can be trained.

1. Jensen's inequality and Kullback Leibler divergence.

Find the mathematical definition of the concave function in slide 1. This definition is related to the Jensen's inequality as shown in slide 2. Jensen's inequality relates the value of a convex function of an integral to the integral of the convex function, and generalizes even in probability theory. In probability theory Jensen's inequality states that the for a concave function f(x), f(EXP[t]) is bigger or equal to EXP[f(t)] (see slide 3).




The Kullback Leibler (KL) divergence is a measure between two probabilistic density function on how close they are. In fact, KL(q¦¦p) is the expected value on probability q(x) for the logarithm of their ratio. Basically we are averaging over the log of the ratio of q(x) and p(x). Yet, mathematically it is not a proper distance as KL(q¦¦p) is not equal to KL(p¦¦q). The other properties are KL(q¦¦q) = 0 and KL(q¦¦p) is always big or equal to zero. This is because the logarithm is concave and Jensens inequality holds (see slide 4 for the proof).

2. General Form of the Expectation Maximization Algorithm.

The problem statement is as follows: we would like to find the maximum likelihood estimate of the parameters with the latent variable t. The likelihood in this case is marginalized with t. Yet, instead of maximizing the marginalized likelihood we use Jensen's inequality to define the lower bound L(theta,q), and maximize L(theta,q) (called variational lower bound).

The optimization takes E-step (finding q that maximizes the variational lower bound) and M-step (finding theta that maximizes the variational lower bound).



3. E-step and M-step.

Again, the E-step is fixing theta and finding q that minimizes the gap between the log of the marginal likelihood and the variational lower bound which is equivalent to maximizing the variational lower bound with respect to q. Mathematically it can be derived that this gap is the same as the KL divergence between the probability density function of the latent variable t and the posterior probability density of the latent variable t given the data and the parameter theta. Since KL divergence is zero if the two distributions are the same, the gap is minimized or equals zero when q(t) is equivalent to the posterior of t given the data and the parameter theta.

The M-step involves fixing the variational distribution q and finding theta that minimizes the gap between the log of the marginal likelihood and the variational lower bound which is equivalent to maximizing the variational lower bound. Mathematically the variational lower bound is the expectation of joing probability distribution of the data X and the latent variable T given the parameters with some constant being added. The expectation is over the variational distribution q here. Usually this function is concave with respect to theta and therefore, the optimization becomes easy.


Note that the convergence is guaranteed for the EM algorithm by definition of E-step and M-step. The objective function is always decreasing although it may converge to the local minima or saddle point.

The EM algorithm is the method for training the latent variable models, with complicated constrains (stochastic gradient descent may not work well as discussed before) and it handles the missing data. The EM invovles the sequences of simple tasks (E-step and M-step) instead of one single hard task. (direct evaluation of the likelihood). Due to nature of this NP hard problem the only problem with the EM algorithm is its convergence to local minima.

4. Youtube tutorials on EM. 

1. Expectation Maximization Algorithm:
https://www.youtube.com/watch?v=AnbiNaVp3eQ&list=PLD0F06AA0D2E8FFBA&index=116 

2. Why EM makes sense? 
https://www.youtube.com/watch?v=6JZ-PKpx5Kc&list=PLD0F06AA0D2E8FFBA&index=117 
https://www.youtube.com/watch?v=X9yF2djExhY&index=118&list=PLD0F06AA0D2E8FFBA 

There are also other videos from the mathematical monk on EM algorithms.

 

Comments

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Introduction to Bayesian methods

Conjugate Priors