Variational Autoencoders
Variational Autoencoders:
1. Scaling Variational Inference and Unbiased Estimate.
See slide 1. Bayesian methods are thought to be mostly suited for small data sets as they are computationally expensive, and to be useful for extracting most information from the small data-set. This view has changed when Bayesian methods met deep learning. The learning goal of this post is also presented in the slide 2.
The rest of the slides are focused on the concept of estimation being unbiased, as building the unbiased estimates for gradients of neural network can be essential. An estimate is called unbiased if its expected values equal to the true mean of the distribution which we want to approximate. Sometimes it is non-trivial to understand if the estimator is unbiased or not - one needs to reduce the particular problem in this case to just expected value of some function which is estimated with average of its samples.
This idea of unbiased estimate for the MC estimation is illustrated in the slides. The key idea is:
2. Modelling the distribution of images.
Lets start fitting the p(x) into a data set. But firstly, why do we need it? The answer is:
Ok. We have enough motivation, but how do we model the data set?
3. Using CNNs with Mixture of Gaussian.
Given the continuous mixture of Gaussian (approach 5 in previous section) and the prior of standard norm (no specific reason) lets force the latent variables t to be around zero and with some unique variants. The likelihood would be Gaussian with some dependence on t.
With parameters that depend on t somehow. So, how can we define these parameters, these pro-metric way to convert t to the parameters of the Gaussian? Well, if we use linear function for Mu of t with some parameters w and b and a constant for sigma of t. Which this Sigma zero can be a parameter or maybe like all these identity matrix, it doesn't matter that much. We'll get the usual PPCA model. And, this probabilistic PPCA model is really nice but it's not powerful enough for our kinds of natural images data. So, let's think what can we change to make this model more powerful. If a linear function is not powerful enough for our purposes, let's use convolutional neural network because it works nice for images data. Right? So, let's say that Mu of t is some convolutional neural network apply it to the latent called t. So it gets as input the latent t and outputs your image or a mean vector for an image. And then Sigma t is also a commercial neural network which takes living quarters input and output your covariance matrix Sigma. This will define our model in some kind of parametric form. So we have them all like this. And let's emphasize that we have some weights and then you'll input w. Let's put them in all parts far off our model definitions. Do not forget about them. We are going to train the model to have them all like this. So pre-meal to facts given the weights of neuron that are w is a mixture of Gaussians, where the parameters of the Gaussians depends on the leading variable t for a convolutional neural network. One problem here is that if for example your images are 100 by 100, then you have just 10000 pixels in each image and it's pretty low resolution. It's not high end in anyway, but even in this case, your covariance matrix will be 10,000 by 10,000. And that's a lot. So we want to avoid that and it's not so reasonable to ask our neural network to output your 10,000 by 10,000 image, or matrix. To get rid of this problem let's just say that our covariance matrix will be diagonal. Instead of outputting the whole large matrix Sigma, we'll ask our neural network to produce just the weights on the diagonal of this covariance matrix. So we will have 10,000 Sigmas here for example and we will put these numbers on the diagonal of covariance matrix to define the actual normal distribution, or condition on the latent variable t. Now our conditional distributions are vectorized. It's Gaussians with zero off diagonal elements in the covariance matrix, but it's okay. Mixture of vectors as Gaussian is not a factor as distribution. So we don't have much problems here. We have our model fully defined, now have to train it somehow. We have to train. The natural way to do it is to use maximum likelihood estimation so to maximize the density of our data set given the parameters; the parameters of the conventional unit neural network. This can be redefined by a sum integral where we marginalize out the latent variable t. Since we have a latent variable, let's use expectation maximization algorithm. It is specifically invented for these kind of models. And in the expectation maximization algorithm, if you recall from week two, we're building a lower bond on the logarithm of this marginal likelihood, P of x given w and we are lower modeling this value by something which depends on w and some new variational parameters Q. And then we'll maximize this lower balance with respect to both w and q to get this lower bound as high as possible as accurate so as close to the actual lower for the margin look like what is possible. And the problem here is that when you step off of the play an expectation maximisation algorithm we have to use we have to find the best years original latent variables. And this is intractable in this case because you have to compute some integrals and your integrals contains convolutional neural networks in them. And this is just too hard to do analytically. So E-M is actually not the way to go here. So what else can we do? Well in the previous week we discussed the Markov chain Monte Carlo and we can use we can use this MCMC to approximate M-step of the expectation maximisation. Right. Well. This way on the amstaff we instead of using the expected value with respect to the Q. Which is in the posterior distribution on the latent variables from the previous iteration in that we will approximate this expected value with samples, with an average and then we'll maximize this iteration instead of the expected value. It's an option we can do that. Well it's going to be kind of slow because this way on each iteration of expectation optimization you have to run like hundreds of situation of Markov chain. Wait until have converged and then start to collect samples. So this way you will have kind of a mess that loop. You will have all the reiterations of expectation maximisation and iterations of Markov chain Monte Carlo and this will probably not be very fast to do. So let's see what else can we do. Well we can try variational inference and the idea of variational inference is to maximize the same lower bound but to restrict the distribution you do be vectorized. So for example if the later they will charge for each data object is 50 dimensional then this Q I of T I will be just a product of 50 one dimensional distributions so it's a nice way to go, it's a nice approach. It will approximate your expectation maximisation but it usually works and pretty fast. But it turns out that in this case even this is intractable. So in this approximation is not enough to get an efficient method for training your latent variable model. And we have to approximate even further. So we have to drive even less accurate approximation to be able to build an efficient method for treating this kind of model.
4. Scaling Variational EM.
So let's see how can we improve the idea of variational inference, such that it will be applicable to our latent variable model. So again the idea of variational inference is to maximize lower bound on the thing we want to maximize actually, with respect to a constraint that says that the variational distribution Q for each object should be factorized. So product of one-dimensional distributions. And let's emphasize the fact that each object has its own individual variational distribution Q, and these distributions are not connected in any way. So, one idea we can use here is as follows. So if saying that variational distribution Q for each object factorized is not enough, let's approximate it even further. And let's say that it's a Gaussian. So not only factorized but a factorized Gaussian. This way everything should be easier. Right? So, every object has its own latent variable T_i. And this latent variable T_i will have variational distribution Q_i, which is a Gaussian with some parameters M_i and S_i, which are parameters of our model which we want to train. Then we will maximize our lower bound with respect to these parameters. So, it's a nice idea, but the problem here is that we just added a lot of parameters for each training objects. So, for example if your latent variable Q_i is 50-dimensional, so it's vector with 50 numbers, then you just added 50 numbers for the vector M_i for each object, and 50 numbers for the vector S_i for each object. So 100 numbers, 100 parameters for each training object. And if you have million of training objects, then it's not a very good idea to add like 100 million parameters to your model, just because of some approximation, right? It will probably overfeed, and it will probably be really hard to train because of this really high number of parameters. And also it's not obvious how to find these parameters, M and S, for new objects to do inference, to do some predictions or generation, because for new objects, you have to solve again some optimization problem to find these parameters, and it can be slow. Okay, so we said that approximating the variational distribution with a factorized one is not enough. Approximation of the factors of the variational distribution with Gaussian is nice, but we have too many parameters for each object, because each of these Gaussians are not connected to each other. They have separate parameters. So let's try to connect these variational distributions Q_i of individual objects. One way we can do that is to say that they are all the same. So Q_i's all equal to each other. We can do that, but it will be too restrictive, we'll not be able to train anything meaningful. Other approach here is to say that all Q_i's are the same distribution, but it depends on X_i's and weight. So let's say that each Q_i is a normal distribution, which has parameters that somehow depend on X_i. So it turns out that actually now each Q_i is different, but they all share the same parameterization. So they all share the same form. And now, even for new objects, we can easily find its variational approximation Q. We can pass this new object through the function M, and for the function S, and then find its parameters of its Gaussian. And this way, we now need to maximize our lower bound with respect to our original parameters W. And this parameter Phi, that defines the parametric way on how we convert X_i's to the parameters of the distribution. And how can we define this with this function M of X_i, and with parameters Phi. Well, as we have already discussed, convolutional neural networks are a really powerful tool to work with images, right? So let's use them here too. So now we will have a convolutional neural network with parameters Phi that looks at your original input image, for example of a cat, and then transforms it to parameters of your variational distribution. And this way, we defined how can we approximate the variational distribution Q in this form, right? Okay, so let's look closer into the object we are trying to maximize. Recall that the lower bound is, by definition, equal to the sum, with respect to the objects in the data set of expected values of sum logarithm with respect to the variation distribution Q_i, right? And recall that in the plane expectation maximization algorithm it was really hard to approximate this expected value by sampling, because the Q and this expected value used to be the true posterior distribution on the latent variable T_i. And this true posterior is complicated, and we know it up to normalization constant. So we have to use Markov chain Monte Carlo to sample from it, which is slow. But now we approximate Q with a Gaussian, with known parameters which we know how to obtain. So for any object, we can pass it through our convolutional neural network with parameters Phi, obtaining parameters M and S, and then we can easily sample from these Gaussian, from these Q, to approximate our expected value. So now again, is a low half of this intractable expected value. We can easily approximate it with sampling because sampling is now cheap, it's just sampling from Gaussians. And if we recall how the model defined, the P of X_i on T, it's actually defined by another convolutional neural network. So the overall workflow will be as follows. We started with training image X, we pass it through the first neural network with parameters Phi. We get the parameters M and S of the variational distribution Q_i. We sample from this distribution one data point, which is something random. It can be different depending on our random seat or something. And then we pass this just sampled vector of latent variable T_i into the second part of our neural network, so into the convolutional neural network with parameters W. And this CNN, this second part, outputs us the distribution on the images, and actually we will try to make this whole structure to return us the images that are as close to the input images as possible. So this thing is looks really close to something called auto encoders in neural networks, which is just a neural network which is trying to output something which is as close as possible to the input. And this model is called variational auto encoder, because in contrast to the usual auto encoders, it has some assembly inside and it has some variational approximations. And the first part of this network is called encoder because it encodes the images into latent code or into the distribution on latent code. And the second part is called decoder, because it decodes the latent code into an image. Let's look what will happen if we forget about the variance in the variational distribution q. So let's say that we set s to be always zero, okay? So for any M(X), S of X is 0. Then the variational distribution QI is actually a deterministic one. It always outputs you the main value, M of XI. And in this case, we are actually directly passing this M of X into the second part of the network, into the decoder. So this way were updating the usual autoencoder, no stochastic elements inside. So this variance in the variational distribution Q is actually something that makes this model different from the usual autoencoder. Okay, so let's look a little bit closer into the objective we're trying to maximize. So this lower band, variational lower band, it can be decomposed into a sum of two terms, because the logarithm of a product is the sum of logarithms, right? And the second term in this equation equals to minus Kullback-Leibler divergence between the variational distribution Q and the prime distribution P of Ti. Just by definition. So KL divergence is something we discussed in week two, and also week three and it's something which measures some kind of a difference between distributions. So when we maximize this minus KL we are actually trying to minimize KL so we are trying to push the variational distribution QI as close to the prior as possible. And the prior is just the standard normal, as we decided, okay? This is the second term and the first term can be interpreted as follows, if for simplicity we set all the output variances to be 1, then this log likelihood of XI given Ti is just minus euclidean distance between XI and the predicted mu of Ti. So this thing is actually a reconstruction loss. It tries to push XI as close to the reconstruction as possible. And mu of Ti is just the mean output of our neural network. So if we consider our whole variational autoencoder, it takes as input an image X, XI, and then it's our posterior mu of Ti plus some noise. And if noise is constant, then we're training this model, we're just trying to make XI as close to mu of Ti as possible which is basically the objective of the usual autoencoder. And note that we are also computing the expected failure of this reconstruction loss with respect to the QI and QI is trying to approximate the posterior distrobution of the latent variables. So we're trying to say that for the latent variables Ti that are likely to cause X, according to our approximation of QI, we want the reconstruction loss to be low. So we want for these particular sensible Ti's for this particular XI, we want the reconstruction to be accurate. And this is kind of the same, not the same but it's really close to the usual autoencoder. But the second part is what makes the difference. This Kullback-Leibler divergence, it's something that pushes the QI to be non-deterministic, to be stochastic. So if you recall the idea that if we set the QI variance to zero we get the usual autoencoder, right? But why, while training the model, will we not choose that? Because if you reduce the number of noise inside it will be easier to train. So why will it choose not to inject noise in itself? Well, because of this regularization. So this KL divergence, it will not allow QI to be very deterministic because if QI variance is zero then this KL term is just infinity and we will not choose this kind of point of parameters. This regularization forces the overall structure to have some noise inside. And also notice that because of this KL divergence, because we are forcing our QI to be close to the standard Gaussian, we may now detect outliers because if we have a usual image from the training data set or something close to the training data set, then if you pass this image through our encoder, then it will output as a distribution, QI, which is close to the standard Gaussian. Because they train it this way. Because during training we try to force all those distributions to lie close to the standard Gaussian. But for a new image which the network never saw, of some suspicious behavior or something else, the conditional neural network of the encoder never saw these kind of images, right? So it can output your distribution on Ti as far away from the Gaussian at it wants. Because it wasn't trained to make them close to Gaussian. And so by looking at the distance between the variational distribution QI and the standard Gaussian, you can understand how anomalistic this point is and you can detect outliers. And also note that it's kind of easy to generate new points, nearly to hallucinate new data in these kind of models. So, because your model is defined this way, as an integral with respect to P of T, you can make a new point, a new image in two steps. First of all, sample Ti from the prior, from the standard normal and then just pass this sample from the standard Gaussian through your decoder network to decode your latent code into an image, and you will get some new samples of a fake silly picture or a fake ad or something.
5. Gradient of Decoder
In the previous video, we completely defined our model. And now everything that is left is to understand how to maximize it, with respect to the weights of both neural networks, w and phi. So we have to maximize this kind of objective. And since it hasn't an expect value inside, we have to approximate it with Monte Carlo somehow, right. So let's look closer into the subject. First of all part is easy. Because it's just KL distance between some Gaussian with known parameters, and the standard Gaussian. So we can compute this term analytically. So although it has an integral inside, we can compute it analytically. And this expression will not cause us any trouble, both in terms of evaluating it and finding gradients with respective parameters. So we can just not think about it and let TensorFlow think about the gradients, if we define the diversions, as this kind of analytical formula. So let's look a little closer into the first term of this expression. That's called f of parameters w and phi. So this function is sum with respect to objects, of expected values of logarithm of probability. And recall that we decided that each q i of individual object, would be some distribution, which on t i, given x i and phi. Which is defined by convolutional neural networks with parameters phi. So let's re-write it as false, and let's start with looks at the gradient of this function with respect to w. So the gradient of this function with respect to w, it looks as false, so half the gradient of sum of expected values. And we'll write the expected value by the definition. So latent variable t i is continuous, and thus, the expected value is just the integral of the probability times the function, the logarithm of p of xi given t i. Now, we can move the gradient sign inside the summation. Because summing, taking the gradient do not interfere with each other, we can swap this sides. And also for smooth and nice functions, we usually can swap the integration and great gradient sides, like this. Finally, since the first function q of t i given x i and phi, it doesn't depend on w, so we can easily push the equation side even further. Because this q is just a constant with respect to w. And it doesn't affect the value of gradient, we just have to multiply the gradient of logarithm with this value. And now we can see that what we obtained is just an expected value of the gradient, right? Sum with respect to the objects in theta set, expected value of the gradient of logarithm. And we can approximate this expect failure by sampling. So we can sample one, for example, point from thedistribution q of t i. And then put that inside the logarithm of p of x i given t i, compute its gradient, with respect to w. So basically what we're doing here is just we're passing our image through our, to get the parameters of the variation distribution theory q of t i. Then we sample on point from the variation distribution. And then we put this point as input to the secondary network with parameters w. And then we just compute the usual gradient of this second neural network with respect to its parameters. And given that its input is this sample t i hat. So this is just the usual gradient. We can use TensorFlow to find it automatically. And finally, this thing depends on the whole data set, but we can easily approximate it with a mini bunch, right? We can write it as some constants to normalize things of some, with respect to minimization of random objects which we have chosen for this particular iteration. And this is a standard stochastic gradient for a neural network. So you don't have to think here too much, you just have to find the gradient with TensorFlow of your second part of your neural network, with respect to parameters. So the overall here is false. We have our function, our objective, we pass our input image through the first convolutional neural network with parameters phi. We find the parameters m and s of variation distribution q. We sample one point from this Gaussian with parameters m and s. We put this point t i hat inside the second convolutional neural network w. And then we treat this t i hat as an input data, as a training object for the second convolutional neural network. And then we just output, we compute the objective of this second CNN. And then just use TensorFlow to differentiate it with respect to parameters. Note that here, we always used unwise estimation of the expected values. So we always substitute expected values with sample averages, and not some more complicated expressions, which are not obviously unbiased or not. So here, everything is unbiased, and on others, this stochastic approximation of the gradient will be correct. And if you do enough iterations, then you will converge to some good point in your parameter space.
6. Log Derivative Trick
Okay, so now let's discuss how to find the gradient with respect to the parameters phi. So here's our objective, and we want to differentiate it. And again, let's rewrite the definition of the expected value as an integral of probability times logarithm as the function. And again, we can move the expected value inside the summation, it will not change anything, and also inside the integral. Again, if the functions are smooth and nice then we can swap integration and differentiation signs. However, in contrast to the case in our previous video, we cannot put the differentiation sign here inside the, so to push it forward near the logarithm. Well, first of all, because the gradient of logarithm of p of x Is zero because it doesn't depend on phi, so the gradient is zero. And so the right-hand side of this expression is just zero, and it's obviously not what the left-hand side is. And the reason why we can't do that is because the q itself dependent on phi. So we have to find this gradient of q respect to phi. And if we do that, then the problem here is that we no longer have an expected value. So if you look at the first equation this slide, it is sum of integrals of gradient of q times logarithm of p. And this thing is not an expected value with respect to any distribution. So you can't approximate it with Monte Carlo, you can't sample from some distribution, and then use the samples to approximate it, because there are no distribution here. Here is just gradient for distribution, which is not a distribution, and a logarithm of a distribution, which is also not a distribution. So we can't approximate this thing with Monte Carlo. And how can we do it, how can we approximate this gradient with something? Well, one thing we can do is the following. We can artificially add some distribution inside. So we can multiply and divide by some distribution q. And then we can treat this q as the probabilities. And then the gradient of q times log p divided by q is the function which we're computing the expected value of. Or if you simplify this expression a little bit, we can say that the gradient of q divided by q is just gradient gradient of logarithm of q by the definition of the gradient of logarithm. And then we can rewrite this formula as follows. So it's an integral of q times the gradient of logarithm of q times logarithm of p, right? So it's just exact formula, we didn't lose anything on some kind of approximation. And now we can say that this last expression is an expect value with respect to q. It's an expected value for this gradient of logarithm of q times logarithm of p. And this sometimes called log-derivative trick, and works for any distribution. So it allows you to differentiate some expected value, even if the gradient of this expected value is not an expected value itself. So now you have an expected value again, and you can sample from the q, and then approximate this gradient with Monte Carlo. It's a valid approach, and until recently people used it, and so it kind of worked. But the problem here is that actually this expected value is a correct value. So it's an exact expression, we didn't lose anything. But if you try to approximate this thing with Monte Carlo, we'll get a really loose approximation. Because the variance of that will be high, and we'll have to sample lots and lots and lots of points to get an approximation for gradient that is at least a little bit accurate. And the reason for that is because, we have this logarithm of p of x. And when we start our training, this p of x is as low as possible, right? because p of x is a distribution over natural images, and has to assign some
distribution to any image. And so at the start, when we don't know anything about our data, any image is really improbable
according to our model. So logarithm of this probability may be like -1 million or something. So the model at the beginning doesn't get used to this training data. So it thinks that these training images are really, really improbable. And this means that we are finding an expected value of something, times -1 million. And then because the gradients of the first term, the gradient of logarithm of q can be positive or negative, then we do Monte Carlo, and average a few samples. We'll get like -1 million plus 900,000 minus 1,100,000 and etc. So we'll get really, really high values in the absolute values, but they will be of different signs. And on others, they will be true, they will be around, I don't know, 100. And this is exact value for
the gradient in this case, for example. But the variance will be so high that you will have to use lots and lots of gradients to approximate this thing accurately. And note that we didn't have this problem in the previous video, because instead of logarithm of p, we had a gradient of logarithm of p. And even if logarithm of p is really like -1 million, then the gradient of that will probably not be that large. So this is a problem, and in the next video, we'll talk about one nice solution to
this problem in this particular case. So how can we estimate this gradient with a small variance estimate?
7. Reparameterization Trick
So let's return to our problem of estimating the gradient of the objective with respect to the parameters phi. In the previous video, we discussed that if we use something called . We can build a stochastic approximation of this gradient. But the variance of this stochastic approximation will be really high. Therefore, will be really inefficient to use this approximation to train the. So let's see, let's look at a really nice and simple and brilliant idea on how to make this think much better. Make this approximation much better. So, let's make a change, let's first of all recall that ti is a sample of distribution from q of ti, given x sin phi. Let's make a change of labels. So let's say, instead of sampling ti, we'll sample some new variable x and y from the standard variable and then we'll make ti from this central line. By multiplying kit for element y in this way by some standard deviation, si, and by aiding the . So this way, the distribution of this expression of this epsilon i times si plus mi is the same as, it's just q, it's the same as ti. So instead of sampling ti from this distribution q, we can state sample epsilon and then apply this deterministic function g. With this multiplying ys and adding m to get the sample from the actual distribution of ti. So we're doing a change of variables. Instead of sampling from ti, we're sampling from epsilon i and then converting it to a sample from ti. And now we can change our objective, we can look at the objective and instead of completing the integral with respect to the distribution q. So expect the distribution q, we can now complete the expected value with the distribution epsilon i. And then instead of ti, use this function of epsilon i everywhere. And this is an exact expression, we didn't lose anything, we just changed the variables. So instead of considering distribution on ti, we're considering distribution on epsilon i and then converting these epsilon i samples to samples from ti. And now this g, this function that converts epsilon i to tis, it depends on xi and on phi. And to convert your epsilon i, it passes your image xi through a convolutional neural network with parameters phi. And this si and mi, and then multiplies epsilon i by si and mi. This is function, licensing of one. And now we can push the gradient sign inside the expected value, so past the probability of epsilon i because doesn't depend on phi. It doesn't depend on the parameters, we are differentiation with respect to. And this means, that now we have an expected value of some expression. Without ever introducing some artificial distributions like in the previous video. We'll like obtain the expected value naturally. And now these expected values with respect to the to the distribution epsilon i, which is just standard normal without any parameters. And now we can approximate this thing with a sample from standard normal. And so ultimately we have re-written our objective, so the gradient of our objective with respect to phi. Is sum with respect to objects of expected value with respect to standard normal of the gradient of some function. Which is just standard gradient of your whole neural network, which defines you the whole operation. Andnow you can redraw this pictures as follows. You have an input image x. You pass it through a convolutional neural network with parameters phi. You compute the regional parameters m and s, then you sample one vector from standard normal distribution epsilon. And then you use all these free values, m, n, s and epsilon to deterministically compute ti. And then you put this ti inside this second convolutional neural network. So when you define your model like this, you have only one place where you have stochastic units. This epsilon i from standard normal distribution. And this way, you can differentiate your whole neural structure with respect to phi and w without trouble. So you're going to just use tender flow and it will find you gradients with respect to all the parameters. Because you don't have now some different shapes with through dissembling, dissembling is kind of outside procedure, it's just yes or nos to determine each functions. And this is basically implementation of theory we have just discussed within this urban mutualization. And now we're going to approximate our gradients by just assembling one point, and then using these gradient of law of this complex function. And this complex function like log of p of xi given g and w is just this full neural network with both encoder and decoder. So to summarize, we have just get the model that allows you to fit probability distribution, like p of x, into a complicated structure of data, for example, into images. And it uses a model of infinite mixture of Gaussians. But to define the parameters of these Gaussians, it uses a variational neural network with parameters that are trained with variational inference. And for learning, we can't use the usual expectation maximization because we have to approximate. And we can't also use variational expectation maximization because it also. So we draft kind of stochastic version of variational inference. That is applicable to, first of all, large data sets, because we can use mini batches. And second of all, it's applicable to the small, so you couldn't have used the usual variational inference for this complicated. Because it has neural networks inside and every integral is intractable. And the model with that is called variational autoencoder, it's like the plain usual autoencoder but It has noise inside and uses regularization to make sure that noise stays. That the chooses the right amount of noise to use. And can be used to for example, generate nice images or to handle missing data or to find in the data and stuff like that.
1. Scaling Variational Inference and Unbiased Estimate.
See slide 1. Bayesian methods are thought to be mostly suited for small data sets as they are computationally expensive, and to be useful for extracting most information from the small data-set. This view has changed when Bayesian methods met deep learning. The learning goal of this post is also presented in the slide 2.
The rest of the slides are focused on the concept of estimation being unbiased, as building the unbiased estimates for gradients of neural network can be essential. An estimate is called unbiased if its expected values equal to the true mean of the distribution which we want to approximate. Sometimes it is non-trivial to understand if the estimator is unbiased or not - one needs to reduce the particular problem in this case to just expected value of some function which is estimated with average of its samples.
This idea of unbiased estimate for the MC estimation is illustrated in the slides. The key idea is:
- Lets say we want to estimate an expected value using the Monte Carlo method where an average of the samples taken from a distribution p(x) is computed.
- This output of the Monte Carlo estimate can be from the approximated distribution R or G depending on how accurately the samples are drawn. If the expected value of R is similar to the true expected value of p(x), then the Monte Carlo estimate of the expectation is unbiased.
2. Modelling the distribution of images.
Lets start fitting the p(x) into a data set. But firstly, why do we need it? The answer is:
- Generate new images.
- Detect anomalies and outliers.
- Work with the missing data.
- Represent the data in a nice way.
Ok. We have enough motivation, but how do we model the data set?
- Lets try a convolutional neural network which will look at the image, and then return the probability of the image. The logarithm of probability is taken to make things more stable. This approach requires computation of the normalization constant. This normalization constant cannot be computed as it is to make the distribution sum up to 1 with respect to sum according to all possible images in the world - simply too expensive.
- Lets try the chain rule. It is known that any proabilistic distribution can be decomposed into a product of some conditional distributions. Its application to natural images, for example, can be: take 3 by 3 pixel image and enumerate each pixel (row by row can be an option) and say that the distribution of the whole image is the same as the joint distribution of pixels. This joint distribution decomposes into the product of conditional distributions by the chain rule, and the distribution on the whole image therefore equals to the probability of the first pixel, marginal probability plus the probability of the second pixel given the first one, and so on. The natural idea on how to represent these conditional probabilities is the use of recurrent neural network in which the images are read pixel by pixel, and the RNN outputs the prediction for the next pixel. This makes the normalization constant one dimensional and thus the problem is simplified. A downside, though, is that we can only generate new images one pixel at a time and this makes the algorithm slow - clearly for some mega pixel images.
- One can advance the point 2 by assuming that the distribution over pixels is independent. In this case one can feed some distribution into the data but is too restrictive assumption. For example, if you saw one half of the image, you can probably restore the other half quite accurately which means that they're not independent.
- Another option is the Gaussian Mixture Model which provides the flexibility in theory. Yet, it is not efficient for practical case with complicated data like natural images. This will required thousands of Gaussian for images.
- One more thing we can try is an infinite mixture of Gaussians like the probabilistic B, C, E, methods we covered in week two. So, here the idea is that each object, each image X has a corresponding latent variable T, and the image X is caused by this T, so we can marginalize out T. And, the conditional distribution X given T is a Gaussian. So, we kind of have a mixture of infinitely many Gaussians, for each value of T, there's one Gaussian and we mix them with weights. Note here that, even if the Gaussians are factorized, so they have independent components for each dimension, the mixture is not. So, this is a little bit more powerful model than the Gaussian mixture model.
3. Using CNNs with Mixture of Gaussian.
Given the continuous mixture of Gaussian (approach 5 in previous section) and the prior of standard norm (no specific reason) lets force the latent variables t to be around zero and with some unique variants. The likelihood would be Gaussian with some dependence on t.
With parameters that depend on t somehow. So, how can we define these parameters, these pro-metric way to convert t to the parameters of the Gaussian? Well, if we use linear function for Mu of t with some parameters w and b and a constant for sigma of t. Which this Sigma zero can be a parameter or maybe like all these identity matrix, it doesn't matter that much. We'll get the usual PPCA model. And, this probabilistic PPCA model is really nice but it's not powerful enough for our kinds of natural images data. So, let's think what can we change to make this model more powerful. If a linear function is not powerful enough for our purposes, let's use convolutional neural network because it works nice for images data. Right? So, let's say that Mu of t is some convolutional neural network apply it to the latent called t. So it gets as input the latent t and outputs your image or a mean vector for an image. And then Sigma t is also a commercial neural network which takes living quarters input and output your covariance matrix Sigma. This will define our model in some kind of parametric form. So we have them all like this. And let's emphasize that we have some weights and then you'll input w. Let's put them in all parts far off our model definitions. Do not forget about them. We are going to train the model to have them all like this. So pre-meal to facts given the weights of neuron that are w is a mixture of Gaussians, where the parameters of the Gaussians depends on the leading variable t for a convolutional neural network. One problem here is that if for example your images are 100 by 100, then you have just 10000 pixels in each image and it's pretty low resolution. It's not high end in anyway, but even in this case, your covariance matrix will be 10,000 by 10,000. And that's a lot. So we want to avoid that and it's not so reasonable to ask our neural network to output your 10,000 by 10,000 image, or matrix. To get rid of this problem let's just say that our covariance matrix will be diagonal. Instead of outputting the whole large matrix Sigma, we'll ask our neural network to produce just the weights on the diagonal of this covariance matrix. So we will have 10,000 Sigmas here for example and we will put these numbers on the diagonal of covariance matrix to define the actual normal distribution, or condition on the latent variable t. Now our conditional distributions are vectorized. It's Gaussians with zero off diagonal elements in the covariance matrix, but it's okay. Mixture of vectors as Gaussian is not a factor as distribution. So we don't have much problems here. We have our model fully defined, now have to train it somehow. We have to train. The natural way to do it is to use maximum likelihood estimation so to maximize the density of our data set given the parameters; the parameters of the conventional unit neural network. This can be redefined by a sum integral where we marginalize out the latent variable t. Since we have a latent variable, let's use expectation maximization algorithm. It is specifically invented for these kind of models. And in the expectation maximization algorithm, if you recall from week two, we're building a lower bond on the logarithm of this marginal likelihood, P of x given w and we are lower modeling this value by something which depends on w and some new variational parameters Q. And then we'll maximize this lower balance with respect to both w and q to get this lower bound as high as possible as accurate so as close to the actual lower for the margin look like what is possible. And the problem here is that when you step off of the play an expectation maximisation algorithm we have to use we have to find the best years original latent variables. And this is intractable in this case because you have to compute some integrals and your integrals contains convolutional neural networks in them. And this is just too hard to do analytically. So E-M is actually not the way to go here. So what else can we do? Well in the previous week we discussed the Markov chain Monte Carlo and we can use we can use this MCMC to approximate M-step of the expectation maximisation. Right. Well. This way on the amstaff we instead of using the expected value with respect to the Q. Which is in the posterior distribution on the latent variables from the previous iteration in that we will approximate this expected value with samples, with an average and then we'll maximize this iteration instead of the expected value. It's an option we can do that. Well it's going to be kind of slow because this way on each iteration of expectation optimization you have to run like hundreds of situation of Markov chain. Wait until have converged and then start to collect samples. So this way you will have kind of a mess that loop. You will have all the reiterations of expectation maximisation and iterations of Markov chain Monte Carlo and this will probably not be very fast to do. So let's see what else can we do. Well we can try variational inference and the idea of variational inference is to maximize the same lower bound but to restrict the distribution you do be vectorized. So for example if the later they will charge for each data object is 50 dimensional then this Q I of T I will be just a product of 50 one dimensional distributions so it's a nice way to go, it's a nice approach. It will approximate your expectation maximisation but it usually works and pretty fast. But it turns out that in this case even this is intractable. So in this approximation is not enough to get an efficient method for training your latent variable model. And we have to approximate even further. So we have to drive even less accurate approximation to be able to build an efficient method for treating this kind of model.
4. Scaling Variational EM.
So let's see how can we improve the idea of variational inference, such that it will be applicable to our latent variable model. So again the idea of variational inference is to maximize lower bound on the thing we want to maximize actually, with respect to a constraint that says that the variational distribution Q for each object should be factorized. So product of one-dimensional distributions. And let's emphasize the fact that each object has its own individual variational distribution Q, and these distributions are not connected in any way. So, one idea we can use here is as follows. So if saying that variational distribution Q for each object factorized is not enough, let's approximate it even further. And let's say that it's a Gaussian. So not only factorized but a factorized Gaussian. This way everything should be easier. Right? So, every object has its own latent variable T_i. And this latent variable T_i will have variational distribution Q_i, which is a Gaussian with some parameters M_i and S_i, which are parameters of our model which we want to train. Then we will maximize our lower bound with respect to these parameters. So, it's a nice idea, but the problem here is that we just added a lot of parameters for each training objects. So, for example if your latent variable Q_i is 50-dimensional, so it's vector with 50 numbers, then you just added 50 numbers for the vector M_i for each object, and 50 numbers for the vector S_i for each object. So 100 numbers, 100 parameters for each training object. And if you have million of training objects, then it's not a very good idea to add like 100 million parameters to your model, just because of some approximation, right? It will probably overfeed, and it will probably be really hard to train because of this really high number of parameters. And also it's not obvious how to find these parameters, M and S, for new objects to do inference, to do some predictions or generation, because for new objects, you have to solve again some optimization problem to find these parameters, and it can be slow. Okay, so we said that approximating the variational distribution with a factorized one is not enough. Approximation of the factors of the variational distribution with Gaussian is nice, but we have too many parameters for each object, because each of these Gaussians are not connected to each other. They have separate parameters. So let's try to connect these variational distributions Q_i of individual objects. One way we can do that is to say that they are all the same. So Q_i's all equal to each other. We can do that, but it will be too restrictive, we'll not be able to train anything meaningful. Other approach here is to say that all Q_i's are the same distribution, but it depends on X_i's and weight. So let's say that each Q_i is a normal distribution, which has parameters that somehow depend on X_i. So it turns out that actually now each Q_i is different, but they all share the same parameterization. So they all share the same form. And now, even for new objects, we can easily find its variational approximation Q. We can pass this new object through the function M, and for the function S, and then find its parameters of its Gaussian. And this way, we now need to maximize our lower bound with respect to our original parameters W. And this parameter Phi, that defines the parametric way on how we convert X_i's to the parameters of the distribution. And how can we define this with this function M of X_i, and with parameters Phi. Well, as we have already discussed, convolutional neural networks are a really powerful tool to work with images, right? So let's use them here too. So now we will have a convolutional neural network with parameters Phi that looks at your original input image, for example of a cat, and then transforms it to parameters of your variational distribution. And this way, we defined how can we approximate the variational distribution Q in this form, right? Okay, so let's look closer into the object we are trying to maximize. Recall that the lower bound is, by definition, equal to the sum, with respect to the objects in the data set of expected values of sum logarithm with respect to the variation distribution Q_i, right? And recall that in the plane expectation maximization algorithm it was really hard to approximate this expected value by sampling, because the Q and this expected value used to be the true posterior distribution on the latent variable T_i. And this true posterior is complicated, and we know it up to normalization constant. So we have to use Markov chain Monte Carlo to sample from it, which is slow. But now we approximate Q with a Gaussian, with known parameters which we know how to obtain. So for any object, we can pass it through our convolutional neural network with parameters Phi, obtaining parameters M and S, and then we can easily sample from these Gaussian, from these Q, to approximate our expected value. So now again, is a low half of this intractable expected value. We can easily approximate it with sampling because sampling is now cheap, it's just sampling from Gaussians. And if we recall how the model defined, the P of X_i on T, it's actually defined by another convolutional neural network. So the overall workflow will be as follows. We started with training image X, we pass it through the first neural network with parameters Phi. We get the parameters M and S of the variational distribution Q_i. We sample from this distribution one data point, which is something random. It can be different depending on our random seat or something. And then we pass this just sampled vector of latent variable T_i into the second part of our neural network, so into the convolutional neural network with parameters W. And this CNN, this second part, outputs us the distribution on the images, and actually we will try to make this whole structure to return us the images that are as close to the input images as possible. So this thing is looks really close to something called auto encoders in neural networks, which is just a neural network which is trying to output something which is as close as possible to the input. And this model is called variational auto encoder, because in contrast to the usual auto encoders, it has some assembly inside and it has some variational approximations. And the first part of this network is called encoder because it encodes the images into latent code or into the distribution on latent code. And the second part is called decoder, because it decodes the latent code into an image. Let's look what will happen if we forget about the variance in the variational distribution q. So let's say that we set s to be always zero, okay? So for any M(X), S of X is 0. Then the variational distribution QI is actually a deterministic one. It always outputs you the main value, M of XI. And in this case, we are actually directly passing this M of X into the second part of the network, into the decoder. So this way were updating the usual autoencoder, no stochastic elements inside. So this variance in the variational distribution Q is actually something that makes this model different from the usual autoencoder. Okay, so let's look a little bit closer into the objective we're trying to maximize. So this lower band, variational lower band, it can be decomposed into a sum of two terms, because the logarithm of a product is the sum of logarithms, right? And the second term in this equation equals to minus Kullback-Leibler divergence between the variational distribution Q and the prime distribution P of Ti. Just by definition. So KL divergence is something we discussed in week two, and also week three and it's something which measures some kind of a difference between distributions. So when we maximize this minus KL we are actually trying to minimize KL so we are trying to push the variational distribution QI as close to the prior as possible. And the prior is just the standard normal, as we decided, okay? This is the second term and the first term can be interpreted as follows, if for simplicity we set all the output variances to be 1, then this log likelihood of XI given Ti is just minus euclidean distance between XI and the predicted mu of Ti. So this thing is actually a reconstruction loss. It tries to push XI as close to the reconstruction as possible. And mu of Ti is just the mean output of our neural network. So if we consider our whole variational autoencoder, it takes as input an image X, XI, and then it's our posterior mu of Ti plus some noise. And if noise is constant, then we're training this model, we're just trying to make XI as close to mu of Ti as possible which is basically the objective of the usual autoencoder. And note that we are also computing the expected failure of this reconstruction loss with respect to the QI and QI is trying to approximate the posterior distrobution of the latent variables. So we're trying to say that for the latent variables Ti that are likely to cause X, according to our approximation of QI, we want the reconstruction loss to be low. So we want for these particular sensible Ti's for this particular XI, we want the reconstruction to be accurate. And this is kind of the same, not the same but it's really close to the usual autoencoder. But the second part is what makes the difference. This Kullback-Leibler divergence, it's something that pushes the QI to be non-deterministic, to be stochastic. So if you recall the idea that if we set the QI variance to zero we get the usual autoencoder, right? But why, while training the model, will we not choose that? Because if you reduce the number of noise inside it will be easier to train. So why will it choose not to inject noise in itself? Well, because of this regularization. So this KL divergence, it will not allow QI to be very deterministic because if QI variance is zero then this KL term is just infinity and we will not choose this kind of point of parameters. This regularization forces the overall structure to have some noise inside. And also notice that because of this KL divergence, because we are forcing our QI to be close to the standard Gaussian, we may now detect outliers because if we have a usual image from the training data set or something close to the training data set, then if you pass this image through our encoder, then it will output as a distribution, QI, which is close to the standard Gaussian. Because they train it this way. Because during training we try to force all those distributions to lie close to the standard Gaussian. But for a new image which the network never saw, of some suspicious behavior or something else, the conditional neural network of the encoder never saw these kind of images, right? So it can output your distribution on Ti as far away from the Gaussian at it wants. Because it wasn't trained to make them close to Gaussian. And so by looking at the distance between the variational distribution QI and the standard Gaussian, you can understand how anomalistic this point is and you can detect outliers. And also note that it's kind of easy to generate new points, nearly to hallucinate new data in these kind of models. So, because your model is defined this way, as an integral with respect to P of T, you can make a new point, a new image in two steps. First of all, sample Ti from the prior, from the standard normal and then just pass this sample from the standard Gaussian through your decoder network to decode your latent code into an image, and you will get some new samples of a fake silly picture or a fake ad or something.
5. Gradient of Decoder
In the previous video, we completely defined our model. And now everything that is left is to understand how to maximize it, with respect to the weights of both neural networks, w and phi. So we have to maximize this kind of objective. And since it hasn't an expect value inside, we have to approximate it with Monte Carlo somehow, right. So let's look closer into the subject. First of all part is easy. Because it's just KL distance between some Gaussian with known parameters, and the standard Gaussian. So we can compute this term analytically. So although it has an integral inside, we can compute it analytically. And this expression will not cause us any trouble, both in terms of evaluating it and finding gradients with respective parameters. So we can just not think about it and let TensorFlow think about the gradients, if we define the diversions, as this kind of analytical formula. So let's look a little closer into the first term of this expression. That's called f of parameters w and phi. So this function is sum with respect to objects, of expected values of logarithm of probability. And recall that we decided that each q i of individual object, would be some distribution, which on t i, given x i and phi. Which is defined by convolutional neural networks with parameters phi. So let's re-write it as false, and let's start with looks at the gradient of this function with respect to w. So the gradient of this function with respect to w, it looks as false, so half the gradient of sum of expected values. And we'll write the expected value by the definition. So latent variable t i is continuous, and thus, the expected value is just the integral of the probability times the function, the logarithm of p of xi given t i. Now, we can move the gradient sign inside the summation. Because summing, taking the gradient do not interfere with each other, we can swap this sides. And also for smooth and nice functions, we usually can swap the integration and great gradient sides, like this. Finally, since the first function q of t i given x i and phi, it doesn't depend on w, so we can easily push the equation side even further. Because this q is just a constant with respect to w. And it doesn't affect the value of gradient, we just have to multiply the gradient of logarithm with this value. And now we can see that what we obtained is just an expected value of the gradient, right? Sum with respect to the objects in theta set, expected value of the gradient of logarithm. And we can approximate this expect failure by sampling. So we can sample one, for example, point from thedistribution q of t i. And then put that inside the logarithm of p of x i given t i, compute its gradient, with respect to w. So basically what we're doing here is just we're passing our image through our, to get the parameters of the variation distribution theory q of t i. Then we sample on point from the variation distribution. And then we put this point as input to the secondary network with parameters w. And then we just compute the usual gradient of this second neural network with respect to its parameters. And given that its input is this sample t i hat. So this is just the usual gradient. We can use TensorFlow to find it automatically. And finally, this thing depends on the whole data set, but we can easily approximate it with a mini bunch, right? We can write it as some constants to normalize things of some, with respect to minimization of random objects which we have chosen for this particular iteration. And this is a standard stochastic gradient for a neural network. So you don't have to think here too much, you just have to find the gradient with TensorFlow of your second part of your neural network, with respect to parameters. So the overall here is false. We have our function, our objective, we pass our input image through the first convolutional neural network with parameters phi. We find the parameters m and s of variation distribution q. We sample one point from this Gaussian with parameters m and s. We put this point t i hat inside the second convolutional neural network w. And then we treat this t i hat as an input data, as a training object for the second convolutional neural network. And then we just output, we compute the objective of this second CNN. And then just use TensorFlow to differentiate it with respect to parameters. Note that here, we always used unwise estimation of the expected values. So we always substitute expected values with sample averages, and not some more complicated expressions, which are not obviously unbiased or not. So here, everything is unbiased, and on others, this stochastic approximation of the gradient will be correct. And if you do enough iterations, then you will converge to some good point in your parameter space.
6. Log Derivative Trick
Okay, so now let's discuss how to find the gradient with respect to the parameters phi. So here's our objective, and we want to differentiate it. And again, let's rewrite the definition of the expected value as an integral of probability times logarithm as the function. And again, we can move the expected value inside the summation, it will not change anything, and also inside the integral. Again, if the functions are smooth and nice then we can swap integration and differentiation signs. However, in contrast to the case in our previous video, we cannot put the differentiation sign here inside the, so to push it forward near the logarithm. Well, first of all, because the gradient of logarithm of p of x Is zero because it doesn't depend on phi, so the gradient is zero. And so the right-hand side of this expression is just zero, and it's obviously not what the left-hand side is. And the reason why we can't do that is because the q itself dependent on phi. So we have to find this gradient of q respect to phi. And if we do that, then the problem here is that we no longer have an expected value. So if you look at the first equation this slide, it is sum of integrals of gradient of q times logarithm of p. And this thing is not an expected value with respect to any distribution. So you can't approximate it with Monte Carlo, you can't sample from some distribution, and then use the samples to approximate it, because there are no distribution here. Here is just gradient for distribution, which is not a distribution, and a logarithm of a distribution, which is also not a distribution. So we can't approximate this thing with Monte Carlo. And how can we do it, how can we approximate this gradient with something? Well, one thing we can do is the following. We can artificially add some distribution inside. So we can multiply and divide by some distribution q. And then we can treat this q as the probabilities. And then the gradient of q times log p divided by q is the function which we're computing the expected value of. Or if you simplify this expression a little bit, we can say that the gradient of q divided by q is just gradient gradient of logarithm of q by the definition of the gradient of logarithm. And then we can rewrite this formula as follows. So it's an integral of q times the gradient of logarithm of q times logarithm of p, right? So it's just exact formula, we didn't lose anything on some kind of approximation. And now we can say that this last expression is an expect value with respect to q. It's an expected value for this gradient of logarithm of q times logarithm of p. And this sometimes called log-derivative trick, and works for any distribution. So it allows you to differentiate some expected value, even if the gradient of this expected value is not an expected value itself. So now you have an expected value again, and you can sample from the q, and then approximate this gradient with Monte Carlo. It's a valid approach, and until recently people used it, and so it kind of worked. But the problem here is that actually this expected value is a correct value. So it's an exact expression, we didn't lose anything. But if you try to approximate this thing with Monte Carlo, we'll get a really loose approximation. Because the variance of that will be high, and we'll have to sample lots and lots and lots of points to get an approximation for gradient that is at least a little bit accurate. And the reason for that is because, we have this logarithm of p of x. And when we start our training, this p of x is as low as possible, right? because p of x is a distribution over natural images, and has to assign some
distribution to any image. And so at the start, when we don't know anything about our data, any image is really improbable
according to our model. So logarithm of this probability may be like -1 million or something. So the model at the beginning doesn't get used to this training data. So it thinks that these training images are really, really improbable. And this means that we are finding an expected value of something, times -1 million. And then because the gradients of the first term, the gradient of logarithm of q can be positive or negative, then we do Monte Carlo, and average a few samples. We'll get like -1 million plus 900,000 minus 1,100,000 and etc. So we'll get really, really high values in the absolute values, but they will be of different signs. And on others, they will be true, they will be around, I don't know, 100. And this is exact value for
the gradient in this case, for example. But the variance will be so high that you will have to use lots and lots of gradients to approximate this thing accurately. And note that we didn't have this problem in the previous video, because instead of logarithm of p, we had a gradient of logarithm of p. And even if logarithm of p is really like -1 million, then the gradient of that will probably not be that large. So this is a problem, and in the next video, we'll talk about one nice solution to
this problem in this particular case. So how can we estimate this gradient with a small variance estimate?
7. Reparameterization Trick
So let's return to our problem of estimating the gradient of the objective with respect to the parameters phi. In the previous video, we discussed that if we use something called . We can build a stochastic approximation of this gradient. But the variance of this stochastic approximation will be really high. Therefore, will be really inefficient to use this approximation to train the. So let's see, let's look at a really nice and simple and brilliant idea on how to make this think much better. Make this approximation much better. So, let's make a change, let's first of all recall that ti is a sample of distribution from q of ti, given x sin phi. Let's make a change of labels. So let's say, instead of sampling ti, we'll sample some new variable x and y from the standard variable and then we'll make ti from this central line. By multiplying kit for element y in this way by some standard deviation, si, and by aiding the . So this way, the distribution of this expression of this epsilon i times si plus mi is the same as, it's just q, it's the same as ti. So instead of sampling ti from this distribution q, we can state sample epsilon and then apply this deterministic function g. With this multiplying ys and adding m to get the sample from the actual distribution of ti. So we're doing a change of variables. Instead of sampling from ti, we're sampling from epsilon i and then converting it to a sample from ti. And now we can change our objective, we can look at the objective and instead of completing the integral with respect to the distribution q. So expect the distribution q, we can now complete the expected value with the distribution epsilon i. And then instead of ti, use this function of epsilon i everywhere. And this is an exact expression, we didn't lose anything, we just changed the variables. So instead of considering distribution on ti, we're considering distribution on epsilon i and then converting these epsilon i samples to samples from ti. And now this g, this function that converts epsilon i to tis, it depends on xi and on phi. And to convert your epsilon i, it passes your image xi through a convolutional neural network with parameters phi. And this si and mi, and then multiplies epsilon i by si and mi. This is function, licensing of one. And now we can push the gradient sign inside the expected value, so past the probability of epsilon i because doesn't depend on phi. It doesn't depend on the parameters, we are differentiation with respect to. And this means, that now we have an expected value of some expression. Without ever introducing some artificial distributions like in the previous video. We'll like obtain the expected value naturally. And now these expected values with respect to the to the distribution epsilon i, which is just standard normal without any parameters. And now we can approximate this thing with a sample from standard normal. And so ultimately we have re-written our objective, so the gradient of our objective with respect to phi. Is sum with respect to objects of expected value with respect to standard normal of the gradient of some function. Which is just standard gradient of your whole neural network, which defines you the whole operation. Andnow you can redraw this pictures as follows. You have an input image x. You pass it through a convolutional neural network with parameters phi. You compute the regional parameters m and s, then you sample one vector from standard normal distribution epsilon. And then you use all these free values, m, n, s and epsilon to deterministically compute ti. And then you put this ti inside this second convolutional neural network. So when you define your model like this, you have only one place where you have stochastic units. This epsilon i from standard normal distribution. And this way, you can differentiate your whole neural structure with respect to phi and w without trouble. So you're going to just use tender flow and it will find you gradients with respect to all the parameters. Because you don't have now some different shapes with through dissembling, dissembling is kind of outside procedure, it's just yes or nos to determine each functions. And this is basically implementation of theory we have just discussed within this urban mutualization. And now we're going to approximate our gradients by just assembling one point, and then using these gradient of law of this complex function. And this complex function like log of p of xi given g and w is just this full neural network with both encoder and decoder. So to summarize, we have just get the model that allows you to fit probability distribution, like p of x, into a complicated structure of data, for example, into images. And it uses a model of infinite mixture of Gaussians. But to define the parameters of these Gaussians, it uses a variational neural network with parameters that are trained with variational inference. And for learning, we can't use the usual expectation maximization because we have to approximate. And we can't also use variational expectation maximization because it also. So we draft kind of stochastic version of variational inference. That is applicable to, first of all, large data sets, because we can use mini batches. And second of all, it's applicable to the small, so you couldn't have used the usual variational inference for this complicated. Because it has neural networks inside and every integral is intractable. And the model with that is called variational autoencoder, it's like the plain usual autoencoder but It has noise inside and uses regularization to make sure that noise stays. That the chooses the right amount of noise to use. And can be used to for example, generate nice images or to handle missing data or to find in the data and stuff like that.
Comments
Post a Comment