Applications and Examples of the EM Algorithm

Applications and Examples of the EM Algorithm:

1. General EM for GMM

Previous we have introduce the EM algorithm intuitively for GMM algorithm. As we have now the general form of the EM algorithm, how do they coincide? For revisiting the GMM see slide 1, and otherwise lets examine the slide 2. Clearly the procedures are exactly the same! (One can derive analytically that M-step is the same as updating Gaussian parameters to fit points assign to them).



Follow these links to see more in detail about how these general form of the EM algorithm is applied to GMM.
Link 1: https://www.youtube.com/watch?v=Rkl30Fr2S38&list=PLD0F06AA0D2E8FFBA&index=119
Link 2: https://www.youtube.com/watch?v=WaKNSBeDLTw&index=120&list=PLD0F06AA0D2E8FFBA
Link 3: https://www.youtube.com/watch?v=pOBXsUec0JA&index=121&list=PLD0F06AA0D2E8FFBA
Link 4: https://www.youtube.com/watch?v=jv2tfR7tyi0&index=122&list=PLD0F06AA0D2E8FFBA 
Link 5: https://www.youtube.com/watch?v=xEShcq1qE7Y&list=PLD0F06AA0D2E8FFBA&index=123

2. K-means from probabilistic perspective.

See the K-means algorithm in slide 1 which we will connect to the general form of the EM algorithm. The K-means algorithm proposes to solve the hard clustering problem by (1) initializing the parameters randomly (just locations of the cluster) (2) repeating two steps until the convergence - first find for each data point the closest cluster (Euclidean distance) and second update the location of the cluster by finding the centroid of the data point that belongs to the same cluster.

To see K-means algorithm from probabilistic perspective, we argue that the K-means algorithm is just  a special case of this EM algorithm applied to GMM (or even some other model). As shown in slide 2 assume the covariance matrices to be all identical matrices (all the shapes of each Gaussian is just uniform circle with fixed radius and the prior weights pi are all uniform). Then, mu is the only parameter to our GMM. Then the density of each data point given that we know the cluster, is some normalization times exponent of point five times the Euclidean distance between x1 and Mu_c where Mu_c is the center of the Gaussian number c. You can prove that this is exactly the formula for multidimensional Gaussian, multivarian Gaussian. What about the E and M steps for this restricted GMM? On the E -step, the EM algorithm suggests you to find the minimum of the KL divergence between the variational distribution q and the posterior distribution p(t) given data and parameters. To make the connection to the hard clustering let's find not the best optimal q but q among some family of simple qs (strictly all delta functions). Then, it will put all the probability mass into the class that had the highest probability according to our posterior. So by restricting the set of possible qs on the E-step, we're actually approximating the actual posterior distribution with delta functions. And this way the optimal q on the E-step will look like it's probability one then t_i equals to some predefined value and zero otherwise. To find c_i which maximizes the posterior probability of latent variable t_i given the data and the parameters theta, recall that the posterior distribution itself is proportional to the full joint distribution which is x, u and t as p of t and parameters and we also have some normalization constant but we don't care about it because we want to maximize this thing with respect to t_i or with respect to c which like respect to the value of t_i. So normalization doesn't change anything. And if you recall, this thing equals to normalization times the x given t which is exponent of minus 0.5 times Euclidean distance and times the prior and we agree that the prior will be just uniform. So it doesn't depend on c and we also don't care about when we maximize this thing. So finally, we want to maximize this expression with respect to the value of t_i and we have normalization one divided by Z and we have. pi of c which both actually doesn't depend on c. So, we can throw it away and will not change anything. And maximizing the exponent of minus something is the same as minimizing something. So we can just minimize 0.5 times Euclidean distance. And finally, we have an expression like our optimal q from the E-step is the same as, is just a delta function which takes the value of arg minimum of the Euclidean distance. So the closest cluster center with probability one and it has probability zero to be anything else. 



Similarly we can prove the M-step part to show that K-means is a restricted GMM with EM algorithm. On the M-step of the Expectation Maximization Algorithm for restricting the Gaussian mixture model, we have to maximize the expression in slide 1. So we have to find the maximum of sum with respect to individual objects in our data set, training objects of expected values with respect to the variational distributions q of t_i of the joined log likelihood of x_i and t_i given the parameters, and here with the parameter suggest mu, so given mu and maximize this expression with respect to mu. Mu one to mu number of clusters. We can maximize this analytically by setting the derivative to zero. This proves that the M-step of the EM algorithm applied to the restricted GMM is the same as K-means.


In summary K-means is actually a special case for applying the EM algorithm to Gaussian mixture model, but first of all, these Gaussian mixture model has to be some kind of restricted one where we assume covariance matrices are to be identical and the prior Gaussian c to be uniform. Note that K-means is faster (less parameters) but less flexible (also less parameters) than GMM.

3. Probabilistic PCA and EM.

What is PCA? Sometimes two random variables are so connected that one may use just one to describe both of them (see slide 1 with example of temperature and ice cream price; here the distance of orange line to some blue points are thought to be too small, and lose of information is not much). This is the idea of the dimensional reduction and the Principal Component Analysis is one of the popular techniques where it tries to find the best possible linear transformation which projects the multidimensional data into lower dimensions while keeping as much information as possible.

What is the probabilistic PCA and why do we need them? The probabilistic PCA is proposed to handle the missing values (probabilistic PCA is far more robust than normal PCA in handling the missing values). The idea of the probabilistic PCA is depicted in slide 3 where we formulate the latent variable t to x with some noises (noises represent the amount of information lost in this case). Assuming the prior on t, the theory of conjugate priors can be used to compute the distributions of the posterior and the likelihood. Thus, we obtained the probabilistic interpretation of PCA analytically. Yet, we need EM algorithm to compute the likelihood because as soon as the model change there exists no analytical solution.




See the slides below to see how the EM algorithm can solve the probabilistic PCA. The slides are self explanatory.


 
The probabilistic formulation of PCA allows for missing values, provides straight forward iterative scheme for large dimensionaities (the EM algorithm!), can do the mixture of PPCA and allows the hyperparameter tuning. (The take away lessons!!).

In summary, the EM algorithm which solves the maximum likelihood iteratively with the probabilistic interpretation, can be used for GMM, K-means and probabilistic PCA.

Comments

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Introduction to Bayesian methods

Conjugate Priors