Zhiwei Gao

In this blog, I would like to record the basic principle behind VAE. In fact, VAE possess an encoder decoder framework, in which we can use different network structures, including DNN, CNN, etc. We can observe the framework in the following figure.

The goal of VAE

Suppose we have a distribution , and we have many samples . What we want to do is to construct a probabilistic surrogate . While generally this task can be quite difficult, we don’t have additional information of the latent variable nor do we know the explicit form of . Thus, what we can do is just to optimize

This process can be quite difficult since we don’t know anything about . However, we can assume that is related to some latent variable , where . By assuming the distribution of , i.e., the prior distribution , we can calculate the through the marginal distribution as

However, this integral is high dimensional typically and it is very difficult to compute since the we need to sample enough prior samples from to approximate the integral. However, when the latent dimension is also very high, the calculation can be untractable since the prior space is two large. However, if we can know the posterior distribution , which can decrease the search space, then the calculation of the integral can be much easier. Generally, we can write the posterior distribution as

This again needs a lot of samples from the prior samples. The solution is to parametrize the posterior distribution using an encoder and then directly sample from it to calculate the maximum likelihood, that is

Thus, we can just maximum the ELBO to get the optimal parameters. The ELBO can be further simplified as

The next question is how to calculate the loss and also the gradient with respect to the parameters. Note that the KL divergence between two Gaussian distribution can be calculated analytically as

Then by assuming the encoder is also a Gaussian distribution with mean and covariance , the first term of ELBO can be calculated as

where is the dimension of the latent variable. To calculate the second term, we need to sample from , this operation involves the following trick.

Reparameterization trick

To calculate the integral and also calculate the gradient with respect to network parameters, we need to use a raprameterization trick to generate samples from , that is

By drawing samples from , the second term of ELBO can also be calculated using Monte Carlo approximation as

If we also assume that is a Gaussian, then this part can be calculated as

Loss function

By combining the derivations in the last section, we can calculate the loss funciton as

Typically, is enough to approximate the integral, then we have the final simplified version as

To further simplify the loss function, if we assume , then the loss function can be written as

A simple example with python code

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from sklearn.datasets import make_moons

# 定义 Encoder
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, latent_size):
        super(Encoder, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2_mean = nn.Linear(hidden_size, latent_size)
        self.fc2_logvar = nn.Linear(hidden_size, latent_size)

    def forward(self, x):
        h = F.relu(self.fc1(x))
        mean = self.fc2_mean(h)
        logvar = self.fc2_logvar(h)
        return mean, logvar

# 定义 Decoder
class Decoder(nn.Module):
    def __init__(self, latent_size, hidden_size, output_size):
        super(Decoder, self).__init__()
        self.fc1 = nn.Linear(latent_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, z):
        h = F.relu(self.fc1(z))
        x = self.fc2(h)
        return x

# 定义 VAE
class VAE(nn.Module):
    def __init__(self, input_size, hidden_size, latent_size):
        super(VAE, self).__init__()
        self.encoder = Encoder(input_size, hidden_size, latent_size)
        self.decoder = Decoder(latent_size, hidden_size, input_size)

    def reparameterize(self, mean, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mean + eps * std

    def forward(self, x):
        mean, logvar = self.encoder(x)
        z = self.reparameterize(mean, logvar)
        recon_x = self.decoder(z)
        return recon_x, mean, logvar

# 参数设置
input_size = 2  # 输入维度
hidden_size = 20  # 隐层维度
latent_size = 2  # 潜在空间维度

# 初始化模型
vae = VAE(input_size, hidden_size, latent_size)

# 定义损失函数
def loss_function(recon_x, x, mean, logvar):
    BCE = F.mse_loss(recon_x, x, reduction='sum')
    KLD = -0.5 * torch.sum(1 + logvar - mean.pow(2) - logvar.exp())
    return BCE + KLD

# 优化器
optimizer = torch.optim.Adam(vae.parameters(), lr=0.001)

# 训练模型
num_epochs = 5000
for epoch in range(num_epochs):
    # 随机生成一个二维高斯分布样本
    data = torch.from_numpy(make_moons(200, noise=0)[0]).float()
    
    # 正向传播
    recon_batch, mean, logvar = vae(data)
    
    # 计算损失
    loss = loss_function(recon_batch, data, mean, logvar)
    
    # 反向传播及优化
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # 输出当前损失
    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# 使用训练好的模型生成样本
with torch.no_grad():
    z = torch.randn(10000, latent_size)
    generated_sample = vae.decoder(z)

print("Generated Sample:", generated_sample)

Email:

Language:

Programming:

Links:

Introduction to variational auto encoder

The goal of VAE

Reparameterization trick

Loss function

A simple example with python code