Latent Variable Models
Latent Variable Models:
1. Latent Variable Models.
A Latent (hidden in Latin) Variable is a variable that you never observe (but are rather inferred in the form of a mathematical model from the observed variables. The observable variables are also called manifest variables.
The Latent Variable Models are the statistical models that relate the sets of manifest variables to the sets of latent variables. The probabilistic models that are hard to evaluate can be formalized in the latent variable models that exhibit less edges (simpler model), fewer parameters and meaningful latent variable (ex, intelligence can be a latent variable to IQ, GPA and performance in a interview).
2. Probabilistic Clustering.
Lets introduce the probabilistic clustering which sources important examples for the latent variable models. The probabilistic model for clustering are also called soft clustering, in oppose to the hard clustering, as demonstrated in the slide below.
The advantage of the probabilistic approach to clustering is (1) hyper parameter tuning (GMM provides means to tune a hyper parameter "number of clusters" which for example, K-means does not provide), and (2) generative model of the data.
3. Gaussian Mixture Models.
How do we model the cluster with probabilities? Lets try with Gaussian distribution.
Note that one Gaussian distribution, by nature has one single peak and it is not appropriate to represent the data shown in the slide. So, one can use multiple Gaussian distributions with different mean and covariance. In this line, the Gaussian Mixture Models are the probabilistic model that assumes all the data point are generated from a mixture of finite Gaussian distributions with unknown parameters. The GMM is very flexible and powerful but we face the increasing number of parameters which can be in practice, not feasible.
Then, how do we train the Gaussian Mixture Models? We can formulate the maximum likelihood as the optimization problem, with constrains that make sure that the definitions of the probability is met. See the definition of GMM and also see the formulation of the optimization problem in the slide below. It has been found that the optimization algorithms such as the stochastic gradient descent is not suitable for meeting the constrains on the covariance matrix (positive definite). On the other hand the expectation maximization algorithm (called EM) works better for these problems.
4. Training GMM.
Training GMM with EM algorithm requires introduction of the latent variable. Again, the probabilistic density function for the data x is the weighted sum of the Gaussian distributions, in GMM. As shown in the slide below, assume there exists a latent variable t connected to the data x. Assume the conditional probability of t taking specific cases c given the parameters to be the weight, and assume the likelihood of the data x given t = c and the parameters to be normally distributed with parameters belonging to the case c. Then the probabilistic model does not change after the marginalization over t.
In general we can say that each data point was generated by using some information from the latent variable t, which exists. We have one latent variable t for each data point x and it causes x. It explains something about x, and the reasonable thing to assume about t here is that it takes 3 radius, 1, 2 or 3 and it shows us from which Gaussian this particular data point came from. The assumptions made above is certainly valid when considering this explanation.
The intuition for EM algorithm is as follows. Since the latent variable t is not observable, we face the chicken and egg problem - the latent variable or source t requires knowledge about the Gaussian parameters, and we need the latent variables to know about the Gaussian parameters. Therefore, in EM we assume random Gaussian parameters initially, and iterate until the convergence is reached, and we fit the data. The last slide talks about the EM algorithm.
5. Example of GMM Training.
The EM algorithm is illustrated, and its problem of reaching the local maxima is demonstrated in the slide below. The EM algorithm starts with random Gaussian parameters (slide 1). At iteration 1 we compute for all the data the posterior of t taking the cases c over the data and the Gaussian parameters. Then, we have specific probabilities for each data point whether they seem to belong to which Gaussian the most. We update the Gaussian parameters to fit the data points assigned to them, with the highest probabilities. These steps are repeated as demonstrated in the slides.
Yet the EM algorithm suffers from the local maxima as shown in the slides. Typically, many random initialization are tried, and the best initial parameters are taken after their convergence.
1. Latent Variable Models.
A Latent (hidden in Latin) Variable is a variable that you never observe (but are rather inferred in the form of a mathematical model from the observed variables. The observable variables are also called manifest variables.
The Latent Variable Models are the statistical models that relate the sets of manifest variables to the sets of latent variables. The probabilistic models that are hard to evaluate can be formalized in the latent variable models that exhibit less edges (simpler model), fewer parameters and meaningful latent variable (ex, intelligence can be a latent variable to IQ, GPA and performance in a interview).
2. Probabilistic Clustering.
Lets introduce the probabilistic clustering which sources important examples for the latent variable models. The probabilistic model for clustering are also called soft clustering, in oppose to the hard clustering, as demonstrated in the slide below.
The advantage of the probabilistic approach to clustering is (1) hyper parameter tuning (GMM provides means to tune a hyper parameter "number of clusters" which for example, K-means does not provide), and (2) generative model of the data.
3. Gaussian Mixture Models.
How do we model the cluster with probabilities? Lets try with Gaussian distribution.
Note that one Gaussian distribution, by nature has one single peak and it is not appropriate to represent the data shown in the slide. So, one can use multiple Gaussian distributions with different mean and covariance. In this line, the Gaussian Mixture Models are the probabilistic model that assumes all the data point are generated from a mixture of finite Gaussian distributions with unknown parameters. The GMM is very flexible and powerful but we face the increasing number of parameters which can be in practice, not feasible.
Then, how do we train the Gaussian Mixture Models? We can formulate the maximum likelihood as the optimization problem, with constrains that make sure that the definitions of the probability is met. See the definition of GMM and also see the formulation of the optimization problem in the slide below. It has been found that the optimization algorithms such as the stochastic gradient descent is not suitable for meeting the constrains on the covariance matrix (positive definite). On the other hand the expectation maximization algorithm (called EM) works better for these problems.
4. Training GMM.
Training GMM with EM algorithm requires introduction of the latent variable. Again, the probabilistic density function for the data x is the weighted sum of the Gaussian distributions, in GMM. As shown in the slide below, assume there exists a latent variable t connected to the data x. Assume the conditional probability of t taking specific cases c given the parameters to be the weight, and assume the likelihood of the data x given t = c and the parameters to be normally distributed with parameters belonging to the case c. Then the probabilistic model does not change after the marginalization over t.
In general we can say that each data point was generated by using some information from the latent variable t, which exists. We have one latent variable t for each data point x and it causes x. It explains something about x, and the reasonable thing to assume about t here is that it takes 3 radius, 1, 2 or 3 and it shows us from which Gaussian this particular data point came from. The assumptions made above is certainly valid when considering this explanation.
The intuition for EM algorithm is as follows. Since the latent variable t is not observable, we face the chicken and egg problem - the latent variable or source t requires knowledge about the Gaussian parameters, and we need the latent variables to know about the Gaussian parameters. Therefore, in EM we assume random Gaussian parameters initially, and iterate until the convergence is reached, and we fit the data. The last slide talks about the EM algorithm.
5. Example of GMM Training.
The EM algorithm is illustrated, and its problem of reaching the local maxima is demonstrated in the slide below. The EM algorithm starts with random Gaussian parameters (slide 1). At iteration 1 we compute for all the data the posterior of t taking the cases c over the data and the Gaussian parameters. Then, we have specific probabilities for each data point whether they seem to belong to which Gaussian the most. We update the Gaussian parameters to fit the data points assigned to them, with the highest probabilities. These steps are repeated as demonstrated in the slides.
Yet the EM algorithm suffers from the local maxima as shown in the slides. Typically, many random initialization are tried, and the best initial parameters are taken after their convergence.
Comments
Post a Comment