Latent Dirichlet Allocation

Latent Dirichlet Allocation:

From wiki - "In natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics."

1. Topic modelling.

Lets say we would like to build a book recommend-er system. We want the algorithm to recommend us whether to read the Sherlock Holmes, Murder on the Orient Express or the Murder at the Vicarage. Lets also say we can extract features or topics from these documents. For example, we can say Sherlock Holmes is 60% detective, 30% adventure and 10% horror. In this way the document is a distribution of topics. Similarly the topic is a distribution over the words(in this world of modelling)! For example, the sports can be 20% football, 10% hockey and so on.

In this world of modelling we can compute the similarity measure between the two documents based on the topics. Here, the similarity measure can be either Euclidean distance or Cosine similarity, or even something-else. If these measures are similar we can recommend such documents!

Now the goal of the recommend-er system is (1) construct topics (from the collection of documents automatically in unsupervised way), and (2) assign topics to documents (an arbitary book to the distribution of the topics). The given data would be the words (can be somethings you googled in the internet) and we want to use the probabilistic model to predict or recommend a book a person would like to read (given the words this person googled).



2. Dirichlet distribution.

Lets find out about the Dirichlet distribution before tackling the topic modelling via the latent dirichlet allocation. See slide 1 for the form of the Dirichlet distribution which is over theta. The vector sums to 1 and is non-negative which makes it simplex. The parameter alpha is positive and determines the shape of the distribution. As shown in slide 2 the alpha of (0.1,0.1,0.1) results in the distribution with peaks concentrated at the corners of the simplex (distribution over theta). On the other hand the alpha of (10,10,10) resulted in the distribution with its peak at the center of the simplex. The slide 3 also shows the case with the alpha of (5,2,2) and (5,5,2). See where the distribution is concentrated on the simplex.The statistics of the Dirichlet distribution (expectation and convariance) is given in the slide 4.

One practical example can be the "massively multiplayer online role playing game" where we have the strength, stamina and speed (which each players can choose). The average over all the players can be represented with the Dirichlet distribution.

One important property of the Dirichlet distribution, among others is that it is a conjugate prior to multinominal likelihood as shown in the last slide.




In summary, the Dirichlet distribution is characterized with theta (simplex) and alpha (shape). The expectation and covariance can be computed as function of these parameters. The Dirichlet distribution is conjugate prior to the multinominal distribution (likelihood of course).

3. Latent Dirichlet Allocation.

Lets now see what the Latent Dirichlet Allocation (LDA) is. Recall that the document is a distribution over topics, and topic is a distribution over words as shown in slide 1 with some examples of cats and dogs, and with some probabilities. In slide 2 lets see how we can generate "Cat meowed on the dog" as an example. The cats have 80% probabilities and dogs have 20% probabilities in documents. The topic cats have for example 40% cat and 30% meow while the topic dogs have 40% dog and 30% woof. The cat meowed on the dog is generated simply by sampling from these distributions.

Now the model can be formulated. Define W as the known data while the unknowns are PHI (parameters for the distribution over words for each topic), Z (topic of each word) and THETA (distribution over topics for each documents). Z and THETA are latent variables. The joint probabilities can be defined as shown in the slides. The topic probabilities is assumed to be Dirichlet.



Now how can we find these unknown parameters and therefore the probabilistic model? The document below discusses in detail, and of course we use the variational methods to train this unsupervised machine learning problem. In summary, the parameter PHI is found by MLE of likelihood p(W¦PHI). The mean field algorithm can be applied in E-step where the family variational distribution is q(theta) and q(z). We can use the analytical formular for finding q(theta) and q(z). Then, the E-step and the M-step ensures that MLE of PHI can be found, and therefore, predictions can be drawn from the principles of the maximum likelihood.

Comments

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Introduction to Bayesian methods

Conjugate Priors