Posts

Showing posts from January, 2018

Bayesian Neural Network

Bayesian Neural Network: This blog post is about Bayesian Neural Network. 1. Introduction to Bayesian Neural Network. Neural Networks are thought deterministic by many people. There is a field of research in looking at the Neural Network from the Bayesian perspective. Let's see how we can apply Markov chain Monte Carlo to Bayesian Neural Networks. See the slide 1 where one can find the usual neural network. Each connection has some weights which would train during basically fitting our neural network into data. In Bayesian methods we see these weights as random variables with distributions. So we treat w, the weights, as a latent variable, and then to do predictions by marginalizing w out. And this way, instead of just hard set failure for W11 like three, we'll have a distribution on w in posterior distribution which we'll use to obtain the predictions. So, the inference at test time involves considering all possible values of the weights and averaging the predictio

Markov Chain Monte Carlo

Markov Chain Monte Carlo: Markov Chain Monte Carlo (MCMC) is the silver bullet of probabilistic modeling. This post explores what MCMC is, introduce how exploit specifics of the problem to speed up MCMC, and understand its limitations. 1. Monte Carlo Estimation. What is the Monte Carlo estimation method? The Monte Carlo estimation is a method in which the expected values are estimated by sampling (average over the number of samples times the sum of the sampled value from distribution p(x) for the function f(x) for more detail). The number of samples should be big enough to estimate the expectation. Why do we need to estimate expected values? One is the M-step of the EM algorithm . Other example is for the full Bayesian inference. 2. Sampling from 1D distribution. The Monte Carlo estimate requires the sampling. Lets look at the discrete case first. Take the discrete distribution in slide 1 as an example. As the sum of all the p is 1, we map the values 0 to 1 to correspon

Latent Dirichlet Allocation

Latent Dirichlet Allocation: From wiki - "In natural language processing , Latent Dirichlet allocation ( LDA ) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics." 1. Topic modelling. Lets say we would like to build a book recommend-er system. We want the algorithm to recommend us whether to read the Sherlock Holmes, Murder on the Orient Express or the Murder at the Vicarage. Lets also say we can extract features or topics from these documents. For example, we can say Sherlock Holmes is 60% detective, 30% adventure and 10% horror. In this way the document is a distribution of topics. Similarly the topic is a distribution over the words(in thi

Variational Inference

Variational Inference: 1. Why approximate inference? Analytical inference ("inference" here refer to computing the posterior distribution, not the one used in the typical deep learning literature, which is a forward pass at test time) is easy when the conjugate priors exist, and hard otherwise. So, in practice, the posterior is approximated. This can be problem at times with the underestimation of the uncertainty.   Variational inference (other name for the approximate inference) requires the following steps: Selection the family of distribution Q (called variational family). Find the best approximation by minimizing the KL divergence. In variational inference the selection of Q can lead to different results - bigger Q results in more accuracy but is often difficult to compute (set of all the possible distributions); smaller Q results in less accurate results but is often easy to compute. The true posterior is to lie in Q, but it is often difficult to know. Not

Applications and Examples of the EM Algorithm

Applications and Examples of the EM Algorithm: 1. General EM for GMM Previous we have introduce the EM algorithm intuitively for GMM algorithm. As we have now the general form of the EM algorithm, how do they coincide? For revisiting the GMM see slide 1, and otherwise lets examine the slide 2. Clearly the procedures are exactly the same! (One can derive analytically that M-step is the same as updating Gaussian parameters to fit points assign to them). Follow these links to see more in detail about how these general form of the EM algorithm is applied to GMM. Link 1: https://www.youtube.com/watch?v=Rkl30Fr2S38&list=PLD0F06AA0D2E8FFBA&index=119 Link 2: https://www.youtube.com/watch?v=WaKNSBeDLTw&index=120&list=PLD0F06AA0D2E8FFBA Link 3: https://www.youtube.com/watch?v=pOBXsUec0JA&index=121&list=PLD0F06AA0D2E8FFBA Link 4: https://www.youtube.com/watch?v=jv2tfR7tyi0&index=122&list=PLD0F06AA0D2E8FFBA   Link 5: https://www.youtube.com/watch?v=x

Expectation Maximization Algorithm

Expectation Maximization Algorithm: The general form of the EM algorithm is discussed in this post so that every latent variable model can be trained. 1. Jensen's inequality and Kullback Leibler divergence. Find the mathematical definition of the concave function in slide 1. This definition is related to the Jensen's inequality as shown in slide 2. Jensen's inequality relates the value of a convex function of an integral to the integral of the convex function, and generalizes even in probability theory. In probability theory Jensen's inequality states that the for a concave function f(x), f(EXP[t]) is bigger or equal to EXP[f(t)] (see slide 3). The Kullback Leibler (KL) divergence is a measure between two probabilistic density function on how close they are. In fact, KL(q¦¦p) is the expected value on probability q(x) for the logarithm of their ratio. Basically we are averaging over the log of the ratio of q(x) and p(x). Yet, mathematically it is not a proper

Latent Variable Models

Latent Variable Models: 1. Latent Variable Models. A Latent (hidden in Latin) Variable is a variable that you never observe (but are rather inferred in the form of a mathematical model from the observed variables. The observable variables are also called manifest variables. The Latent Variable Models are the statistical models that relate the sets of manifest variables to the sets of latent variables. The probabilistic models that are hard to evaluate can be formalized in the latent variable models that exhibit less edges (simpler model), fewer parameters and meaningful latent variable (ex, intelligence can be a latent variable to IQ, GPA and performance in a interview). 2. Probabilistic Clustering. Lets introduce the probabilistic clustering which sources important examples for the latent variable models. The probabilistic model for clustering are also called soft clustering, in oppose to the hard clustering, as demonstrated in the slide below. The advantage of the pro

Conjugate Priors

Conjugate Priors. 1. Analytical inference: See the Bayes formulae. The problem is that the evidence is difficult to estimate for the analytical inference. One way to get away is the maximum posterior (MAP) and is very easy to compute, but it comes with many problems. 1. Lack of the in-variance to re-parameterization. 2. The MAP cannot be used as the prior. 3. The MAP is an untypical point (the mode). 4. The credible regions may not be computed. 2. Conjugate distributions. Another approach to avoid computing the evidence is called the conjugate distributions. As the likelihood and the evidence is fixed by our model, we can only vary the prior so that it is easier to compute the posterior. I f the posterior distribution p(θ|x) are in the same family as the prior distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood. For example, the Gaussian family is conjugate to itself (or self-

Introduction to Bayesian methods

Image
Introduction to Bayesian methods: 1. Bayesian way of thinking: A man is running. Why? Some of the possible explanations are (1) this man is in hurry, (2) this man is running, (3) this man just spotted a dragon, and (4) this man always runs. There are three principles in Bayesian way of thinking which we can use to conclude which of these explanations are the most reasonable. These principles are: 1. Use the prior knowledge on the problem - there is no dragon. 2. Choose the answer that explains the data the most - the man is not wearing the runner. 3. Avoid making extra assumptions - crude to think that any man always run. The answer number "1. this man is in hurry" should be chosen following this principles. 2. Review of probability: Checklists are: 1. What is the probability? 2. What are random variables? 3. What is the probability mass function (PMF)? 4. What is the probability density function (PDF)? 5. What does it mean for the random variables X and Y