Bayesian Neural Network

Bayesian Neural Network:

This blog post is about Bayesian Neural Network.

1. Introduction to Bayesian Neural Network.

Neural Networks are thought deterministic by many people. There is a field of research in looking at the Neural Network from the Bayesian perspective. Let's see how we can apply Markov chain Monte Carlo to Bayesian Neural Networks.

See the slide 1 where one can find the usual neural network. Each connection has some weights which would train during basically fitting our neural network into data. In Bayesian methods we see these weights as random variables with distributions. So we treat w, the weights, as a latent variable, and then to do predictions by marginalizing w out. And this way, instead of just hard set failure for W11 like three, we'll have a distribution on w in posterior distribution which we'll use to obtain the predictions. So, the inference at test time involves considering all possible values of the weights and averaging the predictions with respect to them. So there are infinitely many values for W, and for each of them you pass your image through the corresponding neural network, and write down the prediction. And then you average all these predictions with weights, where weights are the posterior distribution on w, which basically says us, how probable is that these particular w was according to the training data set. So, you have kind of an infinitely large ensemble of neural networks with all possible weights, and with basically importance being proportional to the posterior distribution w. And this is full base in inference applied in neural networks, and this way we can get some benefits from our produced programming in neural networks. This integral in fact equals to an expected value of your output from your neural network, with respect to the posterior distribution w. 

See the slide 2. Think whether we could use MCMC procedure to obtain this posterior distribution. Could we approximate this expected values with Gibbs sampling for example? To solve this problem, let's use your favorite Markov chain Monte Carlo procedure. So let's approximate this expected failure with sampling, for example with Gibbs sampling. And if require a few samples from the posterior distribution w, we can use that Ws, that weights of neural network and then, if we have like, for example, 10 samples. For each sample is a neural network, is a weights for some network. And then for new image, we can just pass it through all this 10 neural networks, and then average their predictions to get approximation of the full weight in inference with an integral. And how can we sample from the posterior? Well, we know it after normalization counts, as usually. So, here this posterior distribution W is proportional to the likelihood, so basically the prediction of a neural network on the training data set with parameters W times the prior, p of w which you can define as you wish, for example, just a standard normal distribution. And you have to divide by normalization constant, which you've done now. But it's okay because Gibbs sampling doesn't care, right? So it's a valid approach, but I think the problem here is that Gibbs sampling or Metropolis-Hastings sampling for that matter, it depends on the whole data set to make its steps, right?


2. Langevin Monte Carlo.

Gibbs and Metropolis Hastings can't do mini-batches.  We discussed at the end of the previous video, that sometimes Gibbs sampling is okay with using mini-batches to make moves, but sometimes it's not. And as far as I know, in Bayesian neural networks, it's not a good idea to use Gibbs sampling with the mini-batches. So, we'll have to do something else. If we don't want to, you know, when we ran our Bayesian neural network on large data set, we don't want to spend time proportional to the size of the whole large data set or at each duration of training. Want to avoid that. So let's see what else can we do. And here comes the really nice idea of something called, Langevin Monte Carlo. So it forces false. Say, we want to sample from the posterior distribution p of w given some data. So train in data X_train and Y_train. Let's start from some initial value for the base w, and then in iterations do updates like this. So here, we update our w to be our previous w, plus epsilon, which is kind of learning create, times gradient of our logarithm of the posterior, plus some random noise. So the first part of this expression is actually a usual gradient ascent applied to train the weights of your neural network. And you can see it here clearly. So if you look at your posterior p of w given data, it will be proportional to logarithm of prior, plus logarithm of the condition distribution, p of y given x and w. And you can write it as follows by using the purpose of logarithm that, like logarithm of multiplication is sum of logarithms. And you should also have a normalization constant, here is that. But is a constant with respect to our optimization problem so we don't care about it, right? And on practice this first term, the prior, if you took a logarithm of a standard to normal distribution for example, it just gets some constant times the Euclidean norm of your weights w. So it's your usual weight decay which people oftenly use in neural networks. And the second term is, usual cross entropy. Usual objective that people use to train neural networks. So this particular update is actually a gradient descent or ascent with step size epsilon applied to your neural network to find the best possible values for parameters. But on each iteration, you add some Gaussian noise with variants being epsilon. So proportional to your learning crate. And if you do that, and if you choose your learning crate to be infinitely small, you can prove that this procedure will eventually generate your sample from the desired distribution, p of w given data. So basically, if you omit the noise, you will just have the usual gradient ascent. And if you use infinitely small learning crate, then you will definitely goes to just the local maximum around the current point, right? But if you add the noise in each iteration, theoretically you can end up in any point in the parameter space, like any point. But of course, with more probability, you will end up somewhere around the local maximum. If you're doing that, you will actually a sample from a posterior distribution. So you will end up in points with high probability of more often than in points with low probability. On practice, you will never use infinitely small learning crate, of course. But one thing you can do about it is to correct this scheme with Metropolis-Hastings. So you can say that theoretically, I should use infinitely small learning crate. I use not infinitely small but like .1, so I have to correct, I'm standing from the wrong distribution. And I can do Metropolis-Hastings correction to reject some of the moves and then to guarantee that I will sample from the current distribution. But, since we want to do some large scale optimization here and to work with mini-batches, we will not use this Metropolis- Hastings corrections because it's not scalable, and we'll just use small learning crate and hope for the best. So this way, we will not actually derive samples from the true posterior distribution w, but will be close enough if your learning crate is small enough, is close enough to the infinitely small, right? So the overall scheme is false. We initialized some ways of your neural network, then we do a few iterations or epochs of your favorite SGD. But on each iteration, you add some noise, some Gaussian noise with a variance being equal to the learning crate, to your update. And notice here also that you can't change learning crate at all, at any stage of your symbolic or you will also break the properties of this Langevin Monte Carlo idea. And then after doing a few iterations like hundred of them, you may say that, okay, I believe that now I have already converged. So, let's collect the full learning samples and use them as actual samples from the posterior distribution. That's the usual idea of Monte Carlo. And then finally, for a new point you can just diverge the predictions of your hundred slightly different neural networks on these new objects to get the prediction for your object. But this is really expensive, right? So there is this really nice and cool idea that we can use a separate neural network that will approximate the behavior of these in sample. So we are simultaneously training these Bayesian neural network. And simultaneously with that, we're using its behavior to train a student neural network that will try to mimic the behavior of this Bayesian neural network in the usual one. And so it has quite a few details there on how to do it efficiently, but it's really cool. So if you're interested in these kind of things, check it out.




Comments

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Introduction to Bayesian methods

Conjugate Priors