Think about such kind of problems, you have a very complex distribution, with an expression that is too difficult to calculate, or we just don't know its explicit expression at all and can only get some samples from it through a very expensive progress which costs a lot. In this case, how can we simplify this problem and sample from the target distribution easily? Think about the density function conversion formula in the class of probability, suppose we have a very simple distribution named base distribution, which is easy to sample, and whose expression can be expressed analyticly, through single or multible maps which are inversible, we can probably change the base distribution to our target distribution, then we can just sample from the base distribution, then transfer the samples through the map to get the target samples. This is called a normalizing flow.
Change of Variables in Probability Distributions
We can transform a probability distribution using an invertible mapping (i.e. bijection). Let be a random variable and an invertible smooth mapping. We can use to transform . The resulting random variable has the following probability distribution:
We can apply a series of mappings and obtain a normalizing flow.
This series of transformations can transform a simple probability distribution (e.g. Gaussian) into a complicated multi-modal one. To be of practical use, however, we can consider only transformations whose determinants of Jacobians are easy to compute. The original paper considered two simple family of transformations, named planar flows and coupling layers.
Simple flows
In this section, I will introduce two common flows, planar flows and coupling layers, which are very useful in our study.
Planar Flow
Planar flows use functions of form
with and and an element-wise non-linearity. Let . The determinant can be easily computed as
The way to implement a normalizing flow that results in a triangular Jacobian is to use *coupling layers*, introduced by Dihn et al. (2015). A coupling layer leaves the first elements unchanged, and the rest of the elements are linearly transformed with a (possibly nonlinear) function of the first elements. More formally, if is the input vector and by we denote a slice of the first elements, a coupling layer implements the following equations to produce the output vector :
where denotes element-wise multiplication, and and are scaling and translation functions that can be implemented with neural networks.
We may now ask two questions:
1) What is the inverse of this transformation? We can easily invert the two equations above to obtain
2) What is the determinant of the Jacobian? Since the first elements of the output are the same as the input, there will be a block in the Jacobian containing an identity matrix. The remaining elements only depend on the first , and themselves, so the Jacobian is triangular with the following structure:
In this case, is a matrix of derivatives we don’t care about, and the determinant reduces to the products of the diagonal in the matrix , that simplifies further when computing the logarithm:
Lastly, for the inverse of the transformation we have
We now have all the ingredients to implement a coupling layer. To select only some parts of the input we will use a binary tensor to mask values, and for the scaling and translation functions we will use a 2-layer MLP, sharing parameters for both functions. The code is as follows:
classCoupling(nn.Module): def__init__(self, dim, num_hidden, mask): super().__init__() # The MLP for the scaling and translation functions self.nn = torch.nn.Sequential(nn.Linear(dim // 2, num_hidden), nn.ReLU(), nn.Linear(num_hidden, num_hidden), nn.ReLU(), nn.Linear(num_hidden, num_hidden), nn.ReLU(), nn.Linear(num_hidden, dim))
# Initialize the coupling to implement the identity transformation self.nn[-1].weight.data.zero_() self.nn[-1].bias.data.zero_() self.register_buffer('mask', mask)#将mask数据放入缓冲区域,避免其改变
defforward(self, z, log_det, inverse=False): mask = self.mask neg_mask = ~mask # Compute scale and translation for relevant inputs s_t = self.nn(z[:, mask]) s, t = torch.chunk(s_t, chunks=2, dim=1)
Similarly to fitting any probabilistic model, fitting a flow-based model to a target distribution can be done by minimizing some divergence or discrepancy between them.
This minimization is performed with respect to the model’s parameters , where are the parameters of and are the parameters of , is the base distribution. In following sections, I will discuss a number of divergences for fitting flow-based models, with a particular focus on
the Kullback–Leibler divergence as it is one of the most popular choices.
Forward KL divergence
The forward KL divergence between the target distribution and the flow-based model can be written as follows:
The forward KL divergence is well-suited for situations in which we have samples from the
target distribution (or the ability to generate them), but we cannot necessarily evaluate
the target density . Assuming we have a set of samples from .
we can
estimate the expectation over by Monte Carlo as follows:
Minimizing the above Monte Carlo approximation of the KL divergence is equivalent to fitting the flow-based model to the samples by maximum likelihood estimation. In practice, we typically optimize the parameters iteratively with stochastic gradient based methods. We can obtain an unbiased estimate of the gradient of the KL divergence with respect to the parameters as follows:
The update with respect to may also be done in closed form if admits closed-form maximum likelihodd estimates, as is the case for example with Gaussian distributions.
Reverse KL divergence
Alternatively, we may fit the flow-based model by minimizing the reverse KL divergence,
which can be written as follows:
We made use of a change of variable in order to express the expectation with respect to . The reverse KL divergence is suitable when we have the ability to evaluate the target density but not necessarily sample from it. In fact, we can minimize even if we can only evalute up to a multiplicative normalizing constant , since in that case will be an additive constant in the above expression for . We may therefore assume that , where is tractable but is not, and rewrite the reverse KL divergence as:
We can minimize iteratively with stochastic gradient-based methods in practice. Since
we reparameterized the expectation to be with respect to the base distribution , we
can easily obtain an unbiased estimate of the gradient of with respect to by Monte Carlo. In particular, let be a set of samples from ; the gradient of with respect to can be estimated as follows
In order to minimize the reverse KL divergence, we need to be able to sample from the base
distribution ) as well as compute and differentiate through the transformation and
its Jacobian determinant. That means that we can fit a flow-based model by minimizing the
reverse KL divergence even if we cannot evaluate the base density or compute the inverse . However, we will need these operations if we would to evaluate the demsity pf the trained model.