Since ability to generate “realistic-looking” data may be a step towards understanding its structure and exploiting it, generative models are an important component of unsupervised learning, which has been a frequent theme on this blog. Today’s post is about Generative Adversarial Networks (GANs), introduced in 2014 by Goodfellow et al., which have quickly become very popular way to train generative models for complicated real-life data. It involves a game-theoretic tussle between a generator player and a discriminator player, which is very attractive and may be useful in other settings.
This post describes GANs and raises some open questions about them. The next post will describe our recent paper addressing these questions.
A generative model $G$ can be seen as taking a random seed $h$ (say, a sample from a multivariate Normal distribution) and converting it into an output string $G(h)$ that “looks” like a real datapoint. Such models are popular in classical statistics but the simpler ones like Gaussian Mixtures or Dirichlet Processes seem insufficient for modeling complicated distributions on natural images or natural language. Generative models are also popular in statistical physics, e.g., Ising models and their cousins. These physics models migrated into machine learning and neuroscience in the 1980s and 1990s, which led to a new generative view of neural nets (e.g., Hinton’s Restricted Boltzmann Machines) which in turn led to multilayer generative models such as stacked denoising autoencoders and variational autoencoders. At their heart, these are nothing but multilayer neural nets that transform the random seed into an output that looks like a realistic image. The primary differences in the model concern details of training. Here is the obligatory set of generated images (source: OpenAI blog)
GANs: The basic framework
GANs also train a deep net $G$ to produce realistic images, but the new and beautiful twist lies in a novel training procedure.
To understand the new twist let’s first discuss what it could mean for the output to “look” realistic. A classic evaluation for generative models is perplexity: a measure of the amount of probability it gives to actual images. This requires that the generative model must be accompanied by an algorithm that computes the probability density function for the generated distribution (i.e., given any image, it must output an estimate of the probability that the model outputs this image.) I might do a future blog post discussing pros and cons of the perplexity measure, but today let’s instead dive straight to GANs, which sidestep the need for perplexity computations.
Idea 1: Since deep nets are good at recognizing images —e.g., distinguishing pictures of people from pictures of cats—why not let a deep net be the judge of the outputs of a generative model?
More concretely, let $P_{real}$ be the distribution over real images, and $P_{synth}$ the one output by the model (i.e., the distribution of $G(h)$ when $h$ is a random seed). We could try to train a discriminator deep net $D$ that maps images to numbers in $[0,1]$ and tries to discriminate between these distributions in the following sense. Its expected output $E_{x}[D(x)]$ as high as possible when $x$ is drawn from $P_{real}$ and as low as possible when $x$ is drawn from $P_{synth}$. This training can be done with the usual backpropagation. If the two distributions are identical then of course no such deep net can exist, and so the training will end in failure. If on the other hand we are able to train a good discriminator deep net —one whose average output is noticeably different between real and synthetic samples— then this is proof positive that the two distributions are different. (There is an in-between case, whereby the distributions are different but the discriminator net doesn’t detect a difference. This is going to be important in the story in the next post.) A natural next question is whether the ability to train such a discriminator deep net can help us improve the generative model.
Idea 2: If a good discriminator net has been trained, use it to provide “gradient feedback” that improves the generative model.
Let $G$ denote the Generator net, which means that samples in $P_{synth}$ are obtained by sampling a uniform gaussian seed $h$ and computing $G(h)$. The natural goal for the generator is to make $E_{h}[D(G(h))]$ as high as possible, because that means it is doing better at fooling the discriminator $D$. So if we fix $D$ the natural way to improve $G$ is to pick a few random seeds $h$, and slightly adjust the trainable parameters of $G$ to increase this objective. Note that this gradient computation involves backpropagation through the composed net $D(G(\cdot))$).
Of course, if we let the generator improve itself, it also makes sense to then let the discriminator improve itself too, Which leads to:
Idea 3: Turn the training of the generative model into a game of many moves or alternations.
Each move for the discriminator consists of taking a few samples from $P_{real}$ and $P_{synth}$ and improving its ability to discriminate between them. Each move for the generator consists of producing a few samples from $P_{synth}$ and updating its parameters so that $E_{u}[D(G(h))]$ goes up a bit.
Notice, the discriminator always uses the generator as a black box —i.e., never examines its internal parameters —whereas the generator needs the discriminator’s parameters to compute its gradient direction. Also, the generator does not ever use real images from $P_{real}$ for its computation. (Though of course it does rely on the real images indirectly since the discriminator is trained using them.)
GANS: More details
One can fill in the above framework in multiple ways. The most obvious is that the generator could try to maximize $E_{u}[f(D(G(h)))]$ where $f$ is some increasing function. (We call this the measuring function.) This has the effect of giving different importance to different samples. Goodfellow et al. originally used $f(x)=\log (x)$, which, since the derivative of $\log x$ is $1/x$, implicitly gives much more importance to synthetic data $G(u)$ where the discriminator outputs very low values $D(G(h))$. In other words, using $f(x) =\log x$ makes the training more sensitive to instances which the discriminator finds terrible than to instances which the discriminator finds so-so. By contrast, the above sketch implicitly used $f(x) =x$, which gives the same importance to all samples and appears in the recent Wasserstein GAN.
The discussion thus leads to the following mathematical formulation, where $D, G$ are deep nets with specified architecture and whose number of parameters is fixed in advance by the algorithm designer.
\[\min_{G} \max_{D}~~E_{x\sim P_{real}}[f(D(x))] + E_{h}[f(1-D(G(h)))]. \qquad (1)\]There is now a big industry of improving this basic framework using various architectures and training variations, e.g. (a random sample; possibly missing some important ones): DC-GAN, S-GAN, SR-GAN, INFO-GAN, etc.
Usually, the training is continued until the generator wins, meaning the discriminator’s expected output on samples from $P_{real}$ and $P_{synth}$ becomes the same. But a serious practical difficulty is that training in practice is oscillatory, and the above objective is observed to go up and down. This is unlike usual deep net training, where training (at least in cases where it works) steadily improves the objective.
GANS: Some open questions
(a) Does an equilibrium exist?
Since GAN is a 2-person game, the oscillatory behavior mentioned above is not unexpected. Just as a necessary condition for gradient descent to come to a stop is that the current point is a stationary point (ie gradient is zero), the corresponding situation in a 2-person game is an equilibrium: each player’s move happens to be its optimal response to the other’s move. In other words, switching the order of $\min$ and $\max$ in expression (1) doesn’t change the objective. The GAN formulation above needs a so-called pure equilibrium, which may not exist in general. A simple example is the classic rock/paper/scissors game. Regardless of whether one player plays rock, paper or scissor as a move, the other can counter with a move that beats it. Thus no pure equilibrium exists.
(b) Does an equilibrium exist where the generator wins, i.e. discriminator ends up unable to distinguish the two distributions on finite samples?
(c) Suppose the generator wins. What does this say about whether or not $P_{real}$ is close to $P_{synth}$ ?
Question (c) has dogged GANs research from the start. Has the generative model actually learned something meaningful about real life images, or is it somehow memorizing existing images and presenting trivial modifications? (Recall that $G$ is never exposed directly to real images, so any “memorizing” has to be happen via the gradient propagated through the discriminator.)
If generator’s win does indeed say that $P_{real}$ and $P_{synth}$ are close then we think of the GANs training as generalizing. (This by analogy to the usual notion of generalization for supervised learning.)
In fact, the next post will show that this issue is indeed more subtle than hitherto recognized. But to complete the backstory I will summarize how this issue has been studied so far.
Past efforts at understanding generalization
The original paper of Goodfellow et al. introduced an analysis of generalization —adopted since by other researchers— that works when deep nets are trained “sufficiently high capacity, samples and training time” (to use their phrasing).
For the original objective function with $f(x) =\log x$ if the optimal discriminator is allowed to be any function all (i.e., not just one computable by a finite capacity neural net) it can be checked that the optimal choice is $D(x) = P_{real}(x)/(P_{real}(x)+P_{synth}(x))$. Substituting this in the GANs objective, up to linear transformation the maximum value achieved by discriminator turns out to be equivalent to the Jensen-Shannon (JS) divergence between $P_{real}$ and $P_{synth}$. Hence if a generator wins the game against this ideal discriminator on a very large number of samples, then $P_{real}$ and $P_{synth}$ are close in JS divergence, and thus the model has learnt the true distribution.
A similar analysis for Wasserstein GANs shows that if the generator wins using the Wasserstein objective (i.e., $f(x) =x$) then the two distributions are close in Wasserstein or earth-mover distance.
But we will see in the next post that these analyses can be misleading because in practice, deep nets have (very) finite capacity and sample size. Thus even if training produces the optimal discriminator, the above analyses can be very off.
Further resources
OpenAI has a brief survey of recent approaches to generative models. The inFERENCe blog has many articles on GANs.
Goodfellow’s survey is the most authoritative account of this burgeoning field, and gives tons of insight. The text around Figure 22 discusses oscillation and lack of equilibria. He also discusses how GANs trained on a broad spectrum of images seem to get confused and output images that are realistic at the micro level but nonsensical overall; e.g., an animal with a leg coming out of its head. Clearly this field, despite its promise, has many open questions!