Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Limitations of Encoder-Decoder GAN architectures<p>This is yet another post about <a href="http://www.offconvex.org/2017/03/15/GANs/">Generative Adversarial Nets (GANs)</a>, and based upon our new <a href="https://openreview.net/forum?id=BJehNfW0-">ICLR’18 paper</a> with Yi Zhang. A quick recap of the story so far. GANs are an unsupervised method in deep learning to learn interesting distributions (e.g., images of human faces), and also have a plethora of uses for image-to-image mappings in computer vision. Standard GANs training is motivated using this task of distribution learning, and is designed with the idea that given large enough deep nets and enough training examples, as well as accurate optimization, GANs will learn the full distribution.</p>
<p><a href="http://www.offconvex.org/2017/03/30/GANs2/">Sanjeev’s previous post</a> concerned <a href="https://arxiv.org/abs/1703.00573">his co-authored ICML’17 paper</a> which called this intuition into question when the deep nets have finite capacity. It shows that the training objective has near-equilibria where the discriminator is fooled —i.e., training objective is good—but the generator’s distributions has very small support, i.e. shows <em>mode collapse.</em> This is a failure of the model, and raises the question whether such bad equilibria are found in real-life training. A <a href="http://www.offconvex.org/2017/07/07/GANs3/">second post</a> showed empirical evidence that they do, using the birthday-paradox test.</p>
<p>The current post concerns our <a href="https://arxiv.org/abs/1711.02651">new result</a> (part of our upcoming <a href="https://openreview.net/forum?id=BJehNfW0-">ICLR paper</a>) which shows that bad equilibria exist also in more recent GAN architectures based on simultaneously learning an <em>encoder</em> and <em>decoder</em>. This should be surprising because many researchers believe that encoder-decoder architectures fix many issues with GANs, including mode collapse.</p>
<p>As we will see, encoder-decoder GANs seem very powerful. In particular, the proof of the previously mentioned <a href="http://www.offconvex.org/2017/03/30/GANs2/">negative result</a> utterly breaks down for this architecture. But, we then discovered a cute argument that shows encoder-decoder GANs can have poor solutions, featuring not only mode collapse but also encoders that map images to nonsense (more precisely Gaussian noise). This is the worst possible failure of the model one could imagine.</p>
<h2 id="encoder-decoder-architectures">Encoder-decoder architectures</h2>
<p>Encoders and decoders have long been around in machine learning in various forms – especially deep learning. Speaking loosely, underlying all of them are two basic assumptions: <br />
(1) Some form of the so-called <a href="https://mitpress.mit.edu/sites/default/files/titles/content/9780262033589_sch_0001.pdf"><em>manifold assumption</em></a> which asserts that high-dimensional data such as real-life images lie (roughly) on a low-dimensional manifold. (“Manifold” should be interpreted rather informally – sometimes this intuition applies only very approximately sometimes it’s meant in a “distributional” sense, etc.) <br />
(2) The low-dimensional structure is “meaningful”: if we think of an image $x$ as a high-dimensional vector and its “code” $z$ as its coordinates on the low-dimensional manifold, the code $z$ is thought of as a “high-level” descriptor of the image.</p>
<p>With the above two points in mind, an <em>encoder</em> maps the image to its code, and a <em>decoder</em> computes the reverse map. (We also discussed encoders and decoders in <a href="http://www.offconvex.org/2017/06/27/unsupervised1/">our earlier post on representation learning</a> in a more general setup.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_manifold2.jpg" width="80%" alt="Manifold structure" />
</p>
<h2 id="encoder-decoder-gans">Encoder-Decoder GANs</h2>
<p>These were introduced by <a href="https://arxiv.org/abs/1606.00704">Dumoulin et al.(ALI)</a> and <a href="https://arxiv.org/abs/1605.09782">Donahue et al.(BiGAN)</a>. They involve two competitors: Player 1 involves a discriminator net $D$ that is given an input of the form (image, code) and it outputs a number in the interval $[0,1]$, which denotes its “satisfaction level” with this input. Player 2 trains a decoder net $G$ (also called <em>generator</em> in the GANs setting) and an encoder net $E$.</p>
<p>Recall that in the standard GAN, discriminator tries to distinguish real images from images generated by the generator $G$. Here
discriminator’s input is an image and its code. Specifically, Player 1 is trying to train its net to distinguish between the following two settings, and Player 2 is trying to make sure the two settings look indistinguishable to Player 1’s net.</p>
<p><script type="math/tex">\mbox{Setting 1: presented with}~(x, E(x))~\mbox{where $x$ is random real image}.</script>
<script type="math/tex">\mbox{Setting 2: presented with}~(G(z), z)~\mbox{where $z$ is random code}.</script></p>
<p>(Here it is assumed that a random code is a vector with i.i.d gaussian coordinates, though one could consider other distributions.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_2settings_v2.jpg" width="80%" alt="Two settings which discriminator net has to distinguish between" />
</p>
<p>The hoped-for equilibrium obviously is one where generator and encoder are inverses of each other: $E(G(z)) \approx z$ and $G(E(x)) \approx x$, and the joint distributions $(z,G(z))$ and $(E(x), x)$ roughly match.
The underlying intuition is that if this happens, Player 1 must’ve produced a “meaningful” representation $E(x)$ for the images – and this should improve the quality of the generator as well.
Indeed, <a href="https://arxiv.org/abs/1606.00704">Dumoulin et al.(ALI)</a> provide some small-scale empirical examples on mixtures of Gaussians for which encoder-decoder architectures seem to ameliorate the problem of mode collapse.</p>
<p>The above papers prove that when the encoder/decoder/discriminator have infinite capacity, the desired solution is indeed an equilibrium. However, we’ll see that things are very different when capacities are finite.</p>
<h2 id="finite-capacity-discriminators-are-weak">Finite-capacity discriminators are weak</h2>
<p>Say a generator/encoder pair $(G,E)$ $\epsilon$-<em>fools</em> a decoder $D$ if</p>
<script type="math/tex; mode=display">|E_{x} D(x, E(x)) - E_{z} D(G(z), z)| \leq \epsilon</script>
<p>In other words, $D$ has roughly similar output in Settings 1 and 2.</p>
<p>Our theorem applies when the distribution consists of realistic images, as explained later. We show the following:</p>
<blockquote>
<p>(Informal theorem) If the discriminator $D$ has capacity (i.e. number of parameters) at most $p$, then there is an encoder $E$ of capacity $\ll p$ and generator $G$ of slightly larger capacity than $p$ such that $(G, E)$ can $\epsilon$-fool every such $D$. Furthermore, the generator exhibits mode collapse: its distribution is essentially supported on a bit more than $p$ images, and the encoder $E$ just outputs white noise (i.e. does not extract any “meaningful” representation) given an image.</p>
</blockquote>
<p>(Note that such a $(G, E)$ represents an $\epsilon$-approximate equilibrium, in the sense that player 1 cannot gain more than $\epsilon$ in the distinguishing probability by switching its discriminator. )</p>
<p>It is important that the encoder’s capacity is much less than $p$, and thus the theorem allows a discriminator that is able to simulate $E$ if it needed, and in particular verify for a random seed $z$ that $E(G(z)) \approx z$. The theorem says that even the ability to conduct such a verification cannot give it power to force encoder to produce meaningful codes. This is a counterintuitive aspect of the result. The main difficulty in the proof (which stumped us for a bit) was how to exhibit such an equilibrium where $E$ is a small net.</p>
<p>This is ensured by a simple assumption. We assume the image distribution is mildly “noised”: say, every 100th pixel is replaced by Gaussian noise. To a human, such an image would of course be indistinguishable from a real image. (NB: Our proof could be carried out via some other assumptions to the effect that images have an innate stochastic/noise component that is efficiently extractable by a small neural network. But let’s keep things clean.) When noise $\eta$ is thus added to an image $x$, we denote the resulting image as $x \odot \eta$.</p>
<p>Now the encoder will be rather trivial: given the noised image $x \odot \eta$, output $\eta$. Clearly, such an encoder does not in any sense capture “meaning” in the image. It is also implementable by a tiny single-layer net, as required by the theorem.</p>
<h3 id="construction-of-generator">Construction of generator</h3>
<p>As usual in the GAN literature, we will assume the discriminator is $L$-<a href="https://www.encyclopediaofmath.org/index.php/Lipschitz_constant">Lipschitz</a>. This can be a loose upperbound, since only $\log L$ enters quantitatively in the proof.</p>
<p>The generator $G(z)$ in the theorem statement memorizes a hash function that partitions the set of all seeds/codes $z$ into $m$ equal-sized blocks; it also memorizes a “pool” of $m := p \log^2(pL)/ \epsilon^2$ unnoised images $\tilde{x}_1, \tilde{x}_2, \dots, \tilde{x}_m$. When presented with a random seed $z$, the generator computes the block of the partition that $z$ lies in, and then produces the image $\tilde{x}_i \odot z$, where $i$ is the block $z$ belongs to. (See the Figure below.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_construction_2.jpg" width="50%" alt="The bad generator construction" />
</p>
<p>Now we have to prove that such a memorizing generator exists that $\epsilon$-fools all discriminators of capacity $p$. This is shown by the <a href="https://en.wikipedia.org/wiki/Probabilistic_method">probabilistic method</a>: we describe a distribution over generators $G$ that works “in expectation”, and subsequently use concentration bounds to prove there exists at least one generator that does the job.</p>
<p>The distribution on $G$’s is straightforward: we select the pool of (unnoised) images
$\tilde{x}_1, \tilde{x}_2, .., \tilde{x}_m$ at random. Why is this distribution for $G$ sensible? Notice the following simple fact:</p>
<script type="math/tex; mode=display">E_{G} E_{z} D(G(z), z) = E_{\tilde{x}, z} D(\tilde{x} \odot z, z) = E_{x} D(x, E(x)) \hspace{2cm} (3)</script>
<p>In other words, the “expected” encoder correctly matches the expectation of $D(x, E(x))$, so that the discriminator is fooled “in expectation”.
This of course is not enough: we need some kind of concentration argument to show a particular $G$ works against <em>all possible discriminators</em>, which will ultimately use the fact that the discriminator $D$ has a small capacity and small Lipschitz constant. (Think covering number arguments in learning theory.)</p>
<p>Towards that, another useful observation: if $q$ is the uniform distribution over sets $T= {z_1, z_2,\dots, z_m}$, s.t. each $z_i$ is independently sampled from the conditional distribution inside the $i$-th block of the partition of the noise space, by the law of total expectation one can see that
<script type="math/tex">E_{z} D(G(z), z) = E_{T \sim q} \frac{1}{m} \sum_{i=1}^m D(G(z_i), z_i)</script>
The right hand side is an average of terms, each of which is a bounded function of mutually independent random variables – so, by e.g. McDiarmid’s inequality it concentrates around it’s expectation, which by (3) is exactly $E_{z} D(G(z), z)$.</p>
<p>To finish the argument off, we use the fact that due to Lipschitzness and the bound on the number of parameters, the “effective” number of distinct discriminators is small, so we can union bound over them. (Formally, this translates to an epsilon-net + union bound argument. This also gives rise to the value of $m$ used in the construction.)</p>
<h2 id="takeaway">Takeaway</h2>
<p>The result should be interpreted as saying that possibly the theoretical foundations of GANs need more work. The current way of thinking about them as distribution learners may not be the right way to formalize them. Furthermore, one has to take care about transfering notions invented for distribution learning, such as encoders and decoders, over into the GANs setting. Finally there is an empirical question whether any of the <a href="https://deephunt.in/the-gan-zoo-79597dc8c347">myriad GANS variations</a> can avoid mode collapse.</p>
Mon, 12 Mar 2018 03:00:00 -0700
http://offconvex.github.io/2018/03/12/bigan/
http://offconvex.github.io/2018/03/12/bigan/Can increasing depth serve to accelerate optimization?<p>“How does depth help?” is a fundamental question in the theory of deep learning. Conventional wisdom, backed by theoretical studies (e.g. <a href="http://proceedings.mlr.press/v49/eldan16.pdf">Eldan & Shamir 2016</a>; <a href="http://proceedings.mlr.press/v70/raghu17a/raghu17a.pdf">Raghu et al. 2017</a>; <a href="http://proceedings.mlr.press/v65/lee17a/lee17a.pdf">Lee et al. 2017</a>; <a href="http://proceedings.mlr.press/v49/cohen16.pdf">Cohen et al. 2016</a>; <a href="http://proceedings.mlr.press/v65/daniely17a/daniely17a.pdf">Daniely 2017</a>; <a href="https://openreview.net/pdf?id=B1J_rgWRW">Arora et al. 2018</a>), holds that adding layers increases expressive power. But often this expressive gain comes at a price –optimization is harder for deeper networks (viz., <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">vanishing/exploding gradients</a>). Recent works on “landscape characterization” implicitly adopt this worldview (e.g. <a href="https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf">Kawaguchi 2016</a>; <a href="https://openreview.net/pdf?id=ryxB0Rtxx">Hardt & Ma 2017</a>; <a href="http://proceedings.mlr.press/v38/choromanska15.pdf">Choromanska et al. 2015</a>; <a href="http://openaccess.thecvf.com/content_cvpr_2017/papers/Haeffele_Global_Optimality_in_CVPR_2017_paper.pdf">Haeffele & Vidal 2017</a>; <a href="https://arxiv.org/pdf/1605.08361.pdf">Soudry & Carmon 2016</a>; <a href="https://arxiv.org/pdf/1712.08968.pdf">Safran & Shamir 2017</a>). They prove theorems about local minima and/or saddle points in the objective of a deep network, while implicitly assuming that the ideal landscape would be convex (single global minimum, no other critical point). My <a href="https://arxiv.org/pdf/1802.06509.pdf">new paper</a> with Sanjeev Arora and Elad Hazan makes the counterintuitive suggestion that sometimes, increasing depth can <em>accelerate</em> optimization.</p>
<p>Our work can also be seen as one more piece of evidence for a nascent belief that <em>overparameterization</em> of deep nets may be a good thing. By contrast, classical statistics discourages training a model with more parameters than necessary <a href="https://www.rasch.org/rmt/rmt222b.htm">as this can lead to overfitting</a>.</p>
<h2 id="ell_p-regression">$\ell_p$ Regression</h2>
<p>Let’s begin by considering a very simple learning problem - scalar linear regression with $\ell_p$ loss (our theory and experiments will apply to $p>2$):</p>
<script type="math/tex; mode=display">\min_{\mathbf{w}}~L(\mathbf{w}):=\frac{1}{p}\sum_{(\mathbf{x},y)\in{S}}(\mathbf{x}^\top\mathbf{w}-y)^p</script>
<p>$S$ here stands for a training set, consisting of pairs $(\mathbf{x},y)$ where $\mathbf{x}$ is a vector representing an instance and $y$ is a (numeric) scalar standing for its label; $\mathbf{w}$ is the parameter vector we wish to learn. Let’s convert the linear model to an extremely simple “depth-2 network”, by replacing the vector $\mathbf{w}$ with a vector $\mathbf{w_1}$ times a scalar $\omega_2$. Clearly, this is an overparameterization that does not change expressiveness, but yields the (non-convex) objective:</p>
<script type="math/tex; mode=display">\min_{\mathbf{w_1},\omega_2}~L(\mathbf{w_1},\omega_2):=\frac{1}{p}\sum_{(\mathbf{x},y)\in{S}}(\mathbf{x}^\top\mathbf{w_1}\omega_2-y)^p</script>
<p>We show in the paper, that if one applies gradient descent over $\mathbf{w_1}$ and $\omega_2$, with small learning rate and near-zero initialization (as customary in deep learning), the induced dynamics on the overall (<em>end-to-end</em>) model $\mathbf{w}=\mathbf{w_1}\omega_2$ can be written as follows:</p>
<script type="math/tex; mode=display">\mathbf{w}^{(t+1)}\leftarrow\mathbf{w}^{(t)}-\rho^{(t)}\nabla{L}(\mathbf{w}^{(t)})-\sum_{\tau=1}^{t-1}\mu^{(t,\tau)}\nabla{L}(\mathbf{w}^{(\tau)})</script>
<p>where $\rho^{(t)}$ and $\mu^{(t,\tau)}$ are appropriately defined (time-dependent) coefficients.
Thus the seemingly benign addition of a single multiplicative scalar turned plain gradient descent into a scheme that somehow has a memory of past gradients —the key feature of <a href="https://distill.pub/2017/momentum/">momentum</a> methods— as well as a time-varying learning rate. While theoretical analysis of the precise benefit of momentum methods is never easy, a simple experiment with $p=4$, on <a href="https://archive.ics.uci.edu/ml/index.php">UCI Machine Learning Repository</a>’s <a href="https://archive.ics.uci.edu/ml/datasets/gas+sensor+array+drift+dataset">“Gas Sensor Array Drift at Different Concentrations” dataset</a>, shows the following effect:</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_L4_exp.png" width="40%" alt="L4 regression experiment" />
</p>
<p>Not only did the overparameterization accelerate gradient descent, but it has done so more than two well-known, explicitly designed acceleration methods – <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">AdaGrad</a> and <a href="https://arxiv.org/pdf/1212.5701.pdf">AdaDelta</a> (the former did not really provide a speedup in this experiment). We observed similar speedups in other settings as well.</p>
<p>What is happening here? Can non-convex objectives corresponding to deep networks be easier to optimize than convex ones?
Is this phenomenon common or is it limited to toy problems as above?
We take a first crack at addressing these questions…</p>
<h2 id="overparameterization-decoupling-optimization-from-expressiveness">Overparameterization: Decoupling Optimization from Expressiveness</h2>
<p>A general study of the effect of depth on optimization entails an inherent difficulty - deeper networks may seem to converge faster due to their superior expressiveness.
In other words, if optimization of a deep network progresses more rapidly than that of a shallow one, it may not be obvious whether this is a result of a true acceleration phenomenon, or simply a byproduct of the fact that the shallow model cannot reach the same loss as the deep one.
We resolve this conundrum by focusing on models whose representational capacity is oblivious to depth - <em>linear neural networks</em>, the subject of many recent studies.
With linear networks, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices - an overparameterization.
Accordingly, if this leads to accelerated convergence, one can be certain that it is not an outcome of any phenomenon other than favorable properties of depth for optimization.</p>
<h2 id="implicit-dynamics-of-depth">Implicit Dynamics of Depth</h2>
<p>Suppose we are interested in learning a linear model parameterized by a matrix $W$, through minimization of some training loss $L(W)$.
Instead of working directly with $W$, we replace it by a depth $N$ linear neural network, i.e. we overparameterize it as $W=W_{N}W_{N-1}\cdots{W_1}$, with $W_j$ being weight matrices of individual layers.
In the paper we show that if one applies gradient descent over $W_{1}\ldots{W}_N$, with small learning rate $\eta$, and with the condition:</p>
<script type="math/tex; mode=display">W_{j+1}^\top W_{j+1} = W_j W_j^\top</script>
<p>satisfied at optimization commencement (note that this approximately holds with standard near-zero initialization), the dynamics induced on the overall end-to-end mapping $W$ can be written as follows:</p>
<script type="math/tex; mode=display">W^{(t+1)}\leftarrow{W}^{(t)}-\eta\sum_{j=1}^{N}\left[W^{(t)}(W^{(t)})^\top\right]^\frac{j-1}{N}\nabla{L}(W^{(t)})\left[(W^{(t)})^\top{W}^{(t)}\right]^\frac{N-j}{N}</script>
<p>We validate empirically that this analytically derived update rule (over classic linear model) indeed complies with deep network optimization, and take a series of steps to theoretically interpret it.
We find that the transformation applied to the gradient $\nabla{L}(W)$ (multiplication from the left by $[WW^\top]^\frac{j-1}{N}$, and from the right by $[W^\top{W}]^\frac{N-j}{N}$, followed by summation over $j$) is a particular preconditioning scheme, that promotes movement along directions already taken by optimization.
More concretely, the preconditioning can be seen as a combination of two elements:</p>
<ul>
<li>an adaptive learning rate that increases step sizes away from initialization; and</li>
<li>a “momentum-like” operation that stretches the gradient along the azimuth taken so far.</li>
</ul>
<p>An important point to make is that the update rule above, referred to hereafter as the <em>end-to-end update rule</em>, does not depend on widths of hidden layers in the linear neural network, only on its depth ($N$).
This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect - it is only the number of layers that matters.
Therefore, acceleration by depth need not be computationally demanding - a fact we clearly observe in our experiments (previous figure for example shows acceleration by orders of magnitude at the price of a single extra scalar parameter).</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_update_rule.png" width="60%" alt="End-to-end update rule" />
</p>
<h2 id="beyond-regularization">Beyond Regularization</h2>
<p>The end-to-end update rule defines an optimization scheme whose steps are a function of the gradient $\nabla{L}(W)$ and the parameter $W$.
As opposed to many acceleration methods (e.g. <a href="https://distill.pub/2017/momentum/">momentum</a> or <a href="https://arxiv.org/pdf/1412.6980.pdf">Adam</a>) that explicitly maintain auxiliary variables, this scheme is memoryless, and by definition born from gradient descent over something (overparameterized objective).
It is therefore natural to ask if we can represent the end-to-end update rule as gradient descent over some regularization of the loss $L(W)$, i.e. over some function of $W$.
We prove, somewhat surprisingly, that the answer is almost always negative - as long as the loss $L(W)$ does not have a critical point at $W=0$, the end-to-end update rule, i.e. the effect of overparameterization, cannot be attained via <em>any</em> regularizer.</p>
<h2 id="acceleration">Acceleration</h2>
<p>So far, we analyzed the effect of depth (in the form of overparameterization) on optimization by presenting an equivalent preconditioning scheme and discussing some of its properties.
We have not, however, provided any theoretical evidence in support of acceleration (faster convergence) resulting from this scheme.
Full characterization of the scenarios in which there is a speedup goes beyond the scope of our paper.
Nonetheless, we do analyze a simple $\ell_p$ regression problem, and find that whether or not increasing depth accelerates depends on the choice of $p$:
for $p=2$ (square loss) adding layers does not lead to a speedup (in accordance with previous findings by <a href="https://arxiv.org/pdf/1312.6120.pdf">Saxe et al. 2014</a>);
for $p>2$ it can, and this may be attributed to the preconditioning scheme’s ability to handle large plateaus in the objective landscape.
A number of experiments, with $p$ equal to 2 and 4, and depths ranging between 1 (classic linear model) and 8, support this conclusion.</p>
<h2 id="non-linear-experiment">Non-Linear Experiment</h2>
<p>As a final test, we evaluated the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting - the <a href="https://github.com/tensorflow/models/tree/master/tutorials/image/mnist">convolutional network tutorial for MNIST built into TensorFlow</a>.
We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer.
With an addition of roughly 15% in number of parameters, optimization accelerated by orders of magnitude:</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_cnn_exp.png" width="40%" alt="TensorFlow MNIST CNN experiment" />
</p>
<p>We note that similar experiments on other convolutional networks also gave rise to a speedup, but not nearly as prominent as the above.
Empirical characterization of conditions under which overparameterization accelerates optimization in non-linear settings is potentially an interesting direction for future research.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Our work provides insight into benefits of depth in the form of overparameterization, from the perspective of optimization.
Many open questions and problems remain.
For example, is it possible to rigorously analyze the acceleration effect of the end-to-end update rule (analogously to, say, <a href="http://www.cis.pku.edu.cn/faculty/vision/zlin/1983-A%20Method%20of%20Solving%20a%20Convex%20Programming%20Problem%20with%20Convergence%20Rate%20O(k%5E(-2))_Nesterov.pdf">Nesterov 1983</a> or <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">Duchi et al. 2011</a>)?
Treatment of non-linear deep networks is of course also of interest, as well as more extensive empirical evaluation.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Fri, 02 Mar 2018 05:00:00 -0800
http://offconvex.github.io/2018/03/02/acceleration-overparameterization/
http://offconvex.github.io/2018/03/02/acceleration-overparameterization/Proving generalization of deep nets via compression<p>This post is about <a href="https://arxiv.org/abs/1802.05296">my new paper with Rong Ge, Behnam Neyshabur, and Yi Zhang</a> which offers some new perspective into the generalization mystery for deep nets discussed in
<a href="http://www.offconvex.org/2017/12/08/generalization1/">my earlier post</a>. The new paper introduces an elementary compression-based framework for proving generalization bounds. It shows that deep nets are highly noise stable, and consequently, compressible. The framework also gives easy proofs (sketched below) of some papers that appeared in the past year.</p>
<p>Recall that the <strong>basic theorem</strong> of generalization theory says something like this: if training set had $m$ samples then the <em>generalization error</em> —defined as the difference between error on training data and test data (aka held out data)— is of the order of $\sqrt{N/m}$. Here $N$ is the number of <em>effective parameters</em> (or <em>complexity measure</em>) of the net; it is at most the actual number of trainable parameters but could be much less. (For ease of exposition this post will ignore nuisance factors like $\log N$ etc. which also appear in the these calculations.) The mystery is that networks with millions of parameters have low generalization error even when $m =50K$ (as in CIFAR10 dataset), which suggests that the number of true parameters is actually much less than $50K$. The papers <a href="https://arxiv.org/abs/1706.08498">Bartlett et al. NIPS’17</a> and <a href="https://openreview.net/forum?id=Skz_WfbCZ">Neyshabur et al. ICLR’18</a>
try to quantify the complexity measure using very interesting ideas like Pac-Bayes and Margin (which influenced our paper). But ultimately the quantitative estimates are fairly vacuous —orders of magnitude <em>more</em> than the number of <em>actual parameters.</em> By contrast our new estimates are several orders of magnitude better, and on the verge of being interesting. See the following bar graph on a log scale. (All bounds are listed ignoring “nuisance factors.” Number of trainable parameters is included only to indicate scale.)</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/acompare.png" width="75%" alt="comparison of bounds from various recent papers" />
</p>
<h2 id="the-compression-approach">The Compression Approach</h2>
<p>The compression approach takes a deep net $C$ with $N$ trainable parameters and tries to compress it to another one $\hat{C}$ that has (a) much fewer parameters $\hat{N}$ than $C$ and (b) has roughly the same training error as $C$.</p>
<p>Then the above basic theorem guarantees that so long as the number of training samples exceeds $\hat{N}$, then $\hat{C}$ <em>does</em> generalize well (even if $C$ doesn’t). An extension of this approach says that the same conclusions holds if we let the compression algorithm to depend upon an arbitrarily long <em>random string</em> provided this string is fixed in advance of seeing the training data. We call this <em>compression with respect to fixed string</em> and rely upon it.</p>
<p>Note that the above approach proves good generalization of the compressed $\hat{C}$, not the original $C$. (I suspect the ideas may extend to proving good generalization of the original $C$; the hurdles seem technical rather than inherent.) Something similar was true of earlier approaches using PAC-Bayes bounds, which also prove the generalization of some net related to $C$, not of $C$ itself. (Hence the tongue-in-cheek title of the classic reference <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford-Caruana2002</a>.)</p>
<p>Of course, in practice deep nets are well-known to be compressible using a slew of ideas—by factors of 10x to 100x; see <a href="https://arxiv.org/abs/1710.09282">the recent survey</a>. However, usually such compression involves <em>retraining</em> the compressed net. Our paper doesn’t consider retraining the net (since it involves reasoning about the loss landscape) but followup work should look at this.</p>
<h2 id="flat-minima-and-noise-stability">Flat minima and Noise Stability</h2>
<p>Modern generalization results can be seen as proceeding via some formalization of a <em>flat minimum</em> of the loss landscape. This was suggested in 1990s as the source of good generalization <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">Hochreiter and Schmidhuber 1995</a>. Recent empirical work of <a href="https://arxiv.org/abs/1609.04836">Keskar et al 2016</a> on modern deep architectures finds that flatness does correlate with better generalization, though the issue is complicated, as discussed in an upcoming post by Behnam Neyshabur.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aflatminima.png" width="65%" alt="Flat vs sharp minima" />
</p>
<p>Here’s the intuition why a flat minimum should generalize better, as originally articulated by <a href="http://www.cs.toronto.edu/~fritz/absps/colt93.pdf">Hinton and Camp 1993</a>. Crudely speaking, suppose a flat minimum is one that occupies “volume” $\tau$ in the landscape. (The flatter the minimum, the higher $\tau$ is.) Then the number of <em>distinct</em> flat minima in the landscape is at most $S =\text{total volume}/\tau$. Thus one can number the flat minima from $1$ to $S$, implying that a flat minimum can be represented using $\log S$ bits. The above-mentioned <em>basic theorem</em> implies that flat minima generalize if the number of training samples $m$ exceeds $\log S$.</p>
<p>PAC-Bayes approaches try to formalize the above intuition by defining a flat minimum as follows: it is a net $C$ such that adding appropriately-scaled gaussian noise to all its trainable parameters does not greatly affect the training error. This allows quantifying the “volume” above in terms of probability/measure (see
<a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">my lecture notes</a> or <a href="https://arxiv.org/abs/1703.11008">Dziugaite-Roy</a>) and yields some explicit estimates on sample complexity. However, obtaining good quantitative estimates from this calculation has proved difficut, as seen in the bar graph earlier.</p>
<p>We formalize “flat minimum” using noise stability of a slightly different form. Roughly speaking, it says that if we inject appropriately scaled gaussian noise at the output of some layer, then this noise gets attenuated as it propagates up to higher layers. (Here “top” direction refers to the output of the net.) This is obviously related to notions like dropout, though it arises also in nets that are not trained with dropout. The following figure illustrates how noise injected at a certain layer of VGG19 (trained on CIFAR10) affects the higher layer. The y-axis denote the magnitude of the noise ($\ell_2$ norm) as a multiple of the vector being computed at the layer, and shows how a single noise vector quickly attenuates as it propagates up the layers.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/attenuate.jpg" width="65%" alt="How noises attenuates as it travels up the layers of VGG." />
</p>
<p>Clearly, computation of the trained net is highly resistant to noise. (This has obvious implications for biological neural nets…)
Note that the training involved no explicit injection of noise (eg dropout). Of course, stochastic gradient descent <em>implicitly</em> adds noise to the gradient, and it would be nice to investigate more rigorously if the noise stability arises from this or from some other source.</p>
<h2 id="noise-stability-and-compressibility-of-single-layer">Noise stability and compressibility of single layer</h2>
<p>To understand why noise-stable nets are compressible, let’s first understand noise stability for a single layer in the net, where we ignore the nonlinearity. Then this layer is just a linear transformation, i.e., matrix $M$.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/alinear.png" width="40%" alt="matrix M describing a single layer" />
</p>
<p>What does it mean that this matrix’s output is stable to noise? Suppose the vector at the previous layer is a unit vector $x$. This is the output of the lower layers on an actual sample, so $x$ can be thought of as the “signal” for the current layer. The matrix converts $x$ into $Mx$. If we inject a noise vector $\eta$ of unit norm at $x$ then the output must become $M(x +\eta)$. We say $M$ is noise stable for input $x$ if such noising affects the output very little, which implies the norm of $Mx$ is much higher than that of $M \eta$.
The former is at most $\sigma_{max}(M)$, the largest singular value of $M$. The latter is approximately
$(\sum_i \sigma_i(M)^2)^{1/2}/\sqrt{h}$ where $\sigma_i(M)$ is the $i$th singular value of $M$ and $h$ is dimension of $Mx$. The reason is that gaussian noise divides itself evenly across all directions, with variance in each direction $1/h$.
We conclude that:
<script type="math/tex">(\sigma_{max}(M))^2 \gg \frac{1}{h} \sum_i (\sigma_i(M)^2),</script></p>
<p>which implies that the matrix has an uneven distribution of singular values. Ratio of left side and right side is called the <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> and is at most the linear algebraic rank. Furthermore, the above analysis suggests that the “signal” $x$ is <em>correlated</em> with the singular directions corresponding to the higher singular values, which is at the root of the noise stability.</p>
<p>Our experiments on VGG and GoogleNet reveal that the higher layers of deep nets—where most of the net’s parameters reside—do indeed exhibit a highly uneven distribution of singular values, and that the signal aligns more with the higher singular directions. The figure below describes layer 10 in VGG19 trained on CIFAR10.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aspectrumlayer10.png" width="45%" alt="distribution of singular values of matrix at layer 10 of VGG19" />
</p>
<h2 id="compressing-multilayer-net">Compressing multilayer net</h2>
<p>The above analysis of noise stability in terms of singular values cannot hold across multiple layers of a deep net, because the mapping becomes nonlinear, thus lacking a notion of singular values. Noise stability is therefore formalized using the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian</a> of this mapping, which is the matrix describing how the output reacts to tiny perturbations of the input. Noise stability says that this nonlinear mapping passes signal (i.e., the vector from previous layers) much more strongly than it does a noise vector.</p>
<p>Our compression algorithm applies a randomized transformation to the matrix of each layer (aside: note the use of randomness, which fits in our “compressing with fixed string” framework) that relies on the low stable rank condition at each layer. This compression introduces error in the layer’s output, but the vector describing this error is “gaussian-like” due to the use of randomness in the compression. Thus this error gets attenuated by higher layers.</p>
<p>Details can be found in the paper. All noise stability properties formalized there are later checked in the experiments section.</p>
<h2 id="simpler-proofs-of-existing-generalization-bounds">Simpler proofs of existing generalization bounds</h2>
<p>In the paper we also use our compression framework to give elementary (say, 1-page) proofs of the previous generalization bounds from the past year. For example, the paper of <a href="https://openreview.net/forum?id=Skz_WfbCZ">Neyshabur et al.</a> shows the following is an upper bound on the generalization error where $A_i$ is the matrix describing the $i$th layer.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aexpression1.png" width="50%" alt="Expression for effective number of parameters in Neyshabur et al" />
</p>
<p>Comparing to the <em>basic theorem</em>, we realize the numerator corresponds to the number of effective parameters. The second part of the expression is the sum of stable ranks of the layer matrices, and is a natural measure of complexity. The first part is product of spectral norms (= top singular value) of the layer matrices, which happens to be an upper bound on the Lipschitz constant of the entire network. (Lipschitz constant of a mapping $f$ in this context is a constant $L$ such that $f(x) \leq L c\dot |x|$.)
The reason this is the Lipschitz constant is that if an input $x$ is presented at the bottom of the net, then each successive layer can multiply its norm by at most the top singular value, and the ReLU nonlinearity can only decrease norm since its only action is to zero out some entries.</p>
<p>Having decoded the above expression, it is clear how to interpret it as an analysis of a (deterministic) compression of the net. Compress each layer by zero-ing out (in the <a href="https://en.wikipedia.org/wiki/Singular-value_decomposition">SVD</a>) singular values less than some threshold $t|A|$, which we hope turns it into a low rank matrix. (Recall that a matrix with rank $r$ can be expressed using $2nr$ parameters.) A simple computation shows that the number of remaining singular values is at most the stable rank divided by $t^2$. How do we set $t$? The truncation introduces error in the layer’s computation, which gets propagated through the higher layers and magnified at most by the Lipschitz constant. We want to make this propagated error small, which can be done by making $t$ inversely proportional to the Lipschitz constant. This leads to the above bound on the number of effective parameters.</p>
<p>This proof sketch also clarifies how our work improves upon the older works: they are also (implicitly) compressing the deep net, but their analysis of how much compression is possible is much more pessimistic because they assume the network transmits noise at peak efficiency given by the Lipschitz constant.</p>
<h2 id="extending-the-ideas-to-convolutional-nets">Extending the ideas to convolutional nets</h2>
<p>Convolutional nets could not be dealt with cleanly in the earlier papers. I must admit that handling convolution stumped us as too for a while. A layer in a convolutional net applies the same filter to all patches in that layer. This <em>weight sharing</em> means that the full layer matrix already has a fairly compact representation, and it seems challenging to compress this further. However, in nets like VGG and GoogleNet, the higher layers use rather large filter matrices (i.e., they use a large number of channels), and one could hope to compress these individual filter matrices.</p>
<p>Let’s discuss the two naive ideas. The first is to compress the filter independently in different patches. This unfortunately is not a compression at all, since each copy of the filter then comes with its own parameters. The second idea is to do a single compression of the filter and use the compressed copy in each patch. This messes up the error analysis because the errors introduced due to compression in the different copies are now correlated, whereas the analysis requires them to be more like gaussian.</p>
<p>The idea we end up using is to compress the filters using $k$-wise independence (an idea from <a href="https://en.wikipedia.org/wiki/K-independent_hashing">theory of hashing schemes</a>), where $k$ is roughly logarithmic in the number of training samples.</p>
<h2 id="concluding-thoughts">Concluding thoughts</h2>
<p>While generalization theory can seem merely academic at times —since in practice held-out data establishes generalizaton— I hope you see from the above account that understanding generalization can give some interesting insights into what is going on in deep net training. Insights about noise stability of trained deep nets have obvious interest for study of biological neural nets. (See also the classic <a href="http://fab.cba.mit.edu/classes/862.16/notes/computation/vonNeumann-1956.pdf">von Neumann ideas</a> on noise resilient computation.)</p>
<p>At the same time, I suspect that compressibility is only one part of the generalization mystery, and that we are still missing some big idea. I don’t see how to use the above ideas to demonstrate that the effective number of parameters in VGG19 is as low as $50k$, as seems to be the case. I suspect doing so will force us to understand the structure of the data (in this case, real-life images) which the above analysis mostly ignores. The only property of data used is that the deep net aligns itself better with data than with noise.</p>
Sat, 17 Feb 2018 08:00:00 -0800
http://offconvex.github.io/2018/02/17/generalization2/
http://offconvex.github.io/2018/02/17/generalization2/Generalization Theory and Deep Nets, An introduction<p>Deep learning holds many mysteries for theory, as we have discussed on this blog. Lately many ML theorists have become interested in the generalization mystery: why do trained deep nets perform well on previously unseen data, even though they have way more free parameters than the number of datapoints (the classic “overfitting” regime)? Zhang et al.’s paper <a href="https://arxiv.org/abs/1611.03530">Understanding Deep Learning requires Rethinking Generalization</a> played some role in bringing attention to this challenge. Their main experimental finding is that if you take a classic convnet architecture, say <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">Alexnet</a>, and train it on images with random labels, then you can still achieve very high accuracy on the training data. (Furthermore, usual regularization strategies, which are believed to promote better generalization, do not help much.) Needless to say, the trained net is subsequently unable to predict the (random) labels of still-unseen images, which means it doesn’t generalize. The paper notes that the ability to fit a classifier to data with random labels is also a traditional measure in machine learning called Rademacher complexity (which we will discuss shortly) and thus Rademacher complexity gives no meaningful bounds on sample complexity. I found this paper entertainingly written and recommend reading it, despite having given away the punchline. Congratulations to the authors for winning best paper at ICLR 2017.</p>
<p>But I would be remiss if I didn’t report that at the <a href="https://simons.berkeley.edu/programs/machinelearning2017">Simons Institute Semester on theoretical ML in spring 2017</a> generalization theory experts expressed unhappiness about this paper, and especially its title. They felt that similar issues had been extensively studied in context of simpler models such as kernel SVMs (which, to be fair, is clearly mentioned in the paper). It is trivial to design SVM architectures with high Rademacher complexity which nevertheless train and generalize well on real-life data. Furthermore, theory was developed to explain this generalization behavior (and also for related models like boosting). On a related note, several earlier papers of Behnam Neyshabur and coauthors (see <a href="https://arxiv.org/abs/1605.07154">this paper</a> and for a full account, <a href="https://arxiv.org/abs/1703.11008">Behnam’s thesis</a>)
had made points fairly similar to Zhang et al. pertaining to deep nets.</p>
<p>But regardless of such complaints, we should be happy about the attention brought by Zhang et al.’s paper to a core theory challenge. Indeed, the passionate discussants at the Simons semester themselves banded up in subgroups to address this challenge: these resulted in papers by <a href="https://arxiv.org/abs/1703.11008">Dzigaite and Roy</a>, then <a href="https://arxiv.org/abs/1706.08498">Bartlett, Foster, and Telgarsky</a> and finally <a href="https://arxiv.org/abs/1707.09564">Neyshabur, Bhojapalli, MacAallester, Srebro</a>. (The latter two were presented at NIPS’17 this week.)</p>
<p>Before surveying these results let me start by suggesting that some of the controversy over the title of Zhang et al.’s paper stems from some basic confusion about whether or not current generalization theory is prescriptive or merely descriptive. These confusions arise from the standard treatment of generalization theory in courses and textbooks, as I discovered while teaching the recent developments in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/">my graduate seminar</a>.</p>
<h3 id="prescriptive-versus-descriptive-theory">Prescriptive versus descriptive theory</h3>
<p>To illustrate the difference, consider a patient who says to his doctor: “Doctor, I wake up often at night and am tired all day.”</p>
<blockquote>
<p>Doctor 1 (without any physical examination): “Oh, you have sleep disorder.”</p>
</blockquote>
<p>I call such a diagnosis <em>descriptive</em>, since it only attaches a label to the patient’s problem, without giving any insight into how to solve the problem. Contrast with:</p>
<blockquote>
<p>Doctor 2 (after careful physical examination): “A growth in your sinus is causing sleep apnea. Removing it will resolve your problems.”</p>
</blockquote>
<p>Such a diagnosis is <em>prescriptive.</em></p>
<h2 id="generalization-theory-descriptive-or-prescriptive">Generalization theory: descriptive or prescriptive?</h2>
<p>Generalization theory notions such as VC dimension, Rademacher complexity, and PAC-Bayes bound, consist of attaching a <em>descriptive label</em> to the basic phenomenon of lack of generalization. They are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems.</p>
<p>Recall what it means for a hypothesis/classifier $h$ to not generalize. Assume the training data consists of a sample $S = {(x_1, y_1), (x_2, y_2),\ldots, (x_m, y_m)}$ of $m$ examples from some distribution ${\mathcal D}$. A <em>loss function</em> $\ell$ describes how well hypothesis $h$ classifies a datapoint: the loss $\ell(h, (x, y))$ is high if the hypothesis didn’t come close to producing the label $y$ on $x$ and low if it came close. (To give an example, the <em>regression</em> loss is $(h(x) -y)^2$.) Now let us denote by $\Delta_S(h)$ the average loss on samplepoints in $S$, and by $\Delta_{\mathcal D}(h)$ the expected loss on samples from distribution ${\mathcal D}$.
Training <em>generalizes</em> if the hypothesis $h$ that minimises $\Delta_S(h)$ for a random sample $S$ also achieves very similarly low loss $\Delta_{\mathcal D}(h)$ on the full distribution. When this fails to happen, we have:</p>
<blockquote>
<p><strong>Lack of generalization:</strong> $\Delta_S(h) \ll \Delta_{\mathcal D}(h) \qquad (1). $</p>
</blockquote>
<p>In practice, lack of generalization is detected by taking a second sample
(“held out set”) $S_2$ of size $m$ from ${\mathcal D}$. By concentration bounds expected loss of $h$ on this second sample closely approximates $\Delta_{\mathcal D}(h)$, allowing us to conclude</p>
<script type="math/tex; mode=display">\Delta_S(h) - \Delta_{S_2}(h) \ll 0 \qquad (2).</script>
<h3 id="generalization-theory-descriptive-parts">Generalization Theory: Descriptive Parts</h3>
<p>Let’s discuss <strong>Rademacher complexity,</strong> which I will simplify a bit for this discussion. (See also <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes of my lecture</a>.) For convenience assume in this discussion that labels and loss are $0,1$, and
assume that the badly generalizing $h$ predicts perfectly on the training sample $S$ and is completely wrong on the heldout set $S_2$, meaning</p>
<p>$\Delta_S(h) - \Delta_{S_2}(h) \approx - 1 \qquad (3)$</p>
<p>Rademacher complexity concerns the following thought experiment. Take a single sample of size $2m$ from $\mathcal{D}$, split it into two and call the first half $S$ and the second $S_2$. <em>Flip</em> the labels of points in $S_2$. Now try to find a classifier $C$ that best describes this new sample, meaning one that minimizes $\Delta_S(h) + 1- \Delta_{S_2}(h)$. This expression follows since flipping the label of a point turns good classification into bad and vice versa, and thus the loss function for $S_2$ is $1$ minus the old loss. We say the class of classifiers has high Rademacher complexity if with high probability this quantity is small, say close to $0$.</p>
<p>But a glance at (3) shows that it implies high Rademacher complexity: $S, S_2$ were random samples of size $m$ from $\mathcal{D}$, so their combined size is $2m$, and when generalization failed we succeeded in finding a hypothesis $h$ for which $\Delta_S(h) + 1- \Delta_{S_2}(h)$ is very small.</p>
<p>In other words, returning to our medical analogy, the doctor only had to hear “Generalization didn’t happen” to pipe up with: “Rademacher complexity is high.” This is why I call this result descriptive.</p>
<p>The <strong>VC dimension</strong> bound is similarly descriptive. VC dimension is defined to be at least $k +1$ if there exists a set of size $k$ such that the following is true. If we look at all possible classifiers in the class, and the sequence of labels each gives to the $k$ datapoints in the sample, then we can find all possible $2^{k}$ sequences of $0$’s and $1$’s.</p>
<p>If generalization does not happen as in (2) or (3) then this turns out to imply that VC dimension is at least around $\epsilon m$ for some $\epsilon >0$. The reason is that the $2m$ data points were split randomly into $S, S_2$, and there are $2^{2m}$ such splittings. When the generalization error is $\Omega(1)$ this can be shown to imply that we can achieve $2^{\Omega(m)}$ labelings of the $2m$ datapoints using all possible classifiers. Now the classic Sauer’s lemma (see any lecture notes on this topic, such as <a href="https://www.cs.princeton.edu/courses/archive/spring14/cos511/scribe_notes/0220.pdf">Schapire’s</a>) can be used to show that
VC dimension is at least $\epsilon m/\log m$ for some constant $\epsilon>0$.</p>
<p>Thus again, the doctor only has to hear “Generalization didn’t happen with sample size $m$” to pipe up with: “VC dimension is higher than $\Omega(m/log m)$.”</p>
<p>One can similarly show that PAC-Bayes bounds are also descriptive, as you can see in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes from my lecture</a>.</p>
<blockquote>
<p>Why do students get confused and think that such tools of generalization theory gives some powerful technique to guide design of machine learning algorithms?</p>
</blockquote>
<p>Answer: Probably because standard presentation in lecture notes and textbooks seems to pretend that we are computationally-omnipotent beings who can <em>compute</em> VC dimension and Rademacher complexity and thus arrive at meaningful bounds on sample sizes needed for training to generalize. While this may have been possible in the old days with simple classifiers, today we have
complicated classifiers with millions of variables, which furthermore are products of nonconvex optimization techniques like backpropagation.
The only way to actually lowerbound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a held-out set. Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al. for highlighting that theory currently offers nothing better.</p>
<h2 id="toward-a-prescriptive-generalization-theory-the-new-papers">Toward a prescriptive generalization theory: the new papers</h2>
<p>In our medical analogy we saw that the doctor needs to at least do a physical examination to have a prescriptive diagnosis. The authors of the new papers intuitively grasp this point, and try to identify properties of real-life deep nets that may lead to better generalization. Such an analysis (related to “margin”) was done for simple 2-layer networks couple decades ago, and the challenge is to find analogs for multilayer networks. Both Bartlett et al. and Neyshabur et al. hone in on <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> of the weight matrices of the layers of the deep net. These can be seen as an instance of a “flat minimum” which has been discussed in <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">neural nets literature</a> for many years. I will present my take on these results as well as some improvements in a future post. Note that these methods do not as yet give any nontrivial bounds on the number of datapoints needed for training the nets in question.</p>
<p><a href="https://arxiv.org/abs/1703.11008">Dziugaite and Roy</a> take a slightly different tack. They start with McAllester’s 1999 PAC-Bayes bound, which says that if the algorithm’s prior distribution on the hypotheses is $P$ then for every posterior distributions $Q$ (which could depend on the data) on the hypotheses the generalization error of the average classifier picked according to $Q$ is upper bounded as follows where $D()$ denotes <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>:</p>
<div style="text-align:center;">
<img style="width:600px;" src="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/pacbayes.png" />
</div>
<p>This allows upperbounds on generalization error (specifically, upperbounds on number of samples that guarantee such an upperbound) by proceeding as in <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford and Caruana’s old paper</a> where $P$ is a uniform gaussian, and $Q$ is a noised version of the trained deep net (whose generalization we are trying to explain). Specifically, if $w_{ij}$ is the weight of edge ${i, j}$ in the trained net, then $Q$ consists of adding a gaussian noise $\eta_{ij}$ to weight $w_{ij}$. Thus a random classifier according to $Q$ is nothing but a noised version of the trained net. Now we arrive at the crucial idea: Use nonconvex optimization to find a choice for the variance of $\eta_{ij}$ that balances two competing criteria: (a) the average classifier drawn from $Q$ has training error not much more than the original trained net (again, this is a quantification of the “flatness” of the minimum found by the optimization) and (b) the right hand side of the above expression is as small as possible. Assuming (a) and (b) can be suitably bounded, it follows that the average classifier from Q works reasonably well on unseen data. (Note that this method only proves generalization of a noised version of the trained classifier.)</p>
<p>Applying this method on simple fully-connected neural nets trained on MNIST dataset, they can prove that the method achieves error $17$ percent error on MNIST (whereas the <em>actual</em> error is much lower at 2-3 percent). Hence the title of their paper, which promises <em>nonvacuous generalization bounds.</em> What I find most interesting about this result is that it uses the power of nonconvex optimization (harnessed above to find a suitable noised distribution $Q$) to cast light on one of the metaquestions about nonconvex optimization, namely, why does deep learning not overfit!</p>
Fri, 08 Dec 2017 10:00:00 -0800
http://offconvex.github.io/2017/12/08/generalization1/
http://offconvex.github.io/2017/12/08/generalization1/How to Escape Saddle Points Efficiently<p>A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s</a> blog posts), the critical open problem is one of <strong>efficiency</strong> — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe <a href="https://arxiv.org/abs/1703.00887">our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade</a>, that provides the first provable <em>positive</em> answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!</p>
<h2 id="perturbing-gradient-descent">Perturbing Gradient Descent</h2>
<p>We are in the realm of classical gradient descent (GD) — given a function $f:\mathbb{R}^d \to \mathbb{R}$ we aim to minimize the function by moving in the direction of the negative gradient:</p>
<script type="math/tex; mode=display">x_{t+1} = x_t - \eta \nabla f(x_t),</script>
<p>where $x_t$ are the iterates and $\eta$ is the step size. GD is well understood theorietically in the case of convex optimization, but the general case of nonconvex optimization has been far less studied. We know that GD converges quickly to the neighborhood of stationary points (points where $\nabla f(x) = 0$) in the nonconvex setting, but these stationary points may be local minima or, unhelpfully, local maxima or saddle points.</p>
<p>Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:</p>
<ol>
<li>
<p><strong>Intermittent Perturbations</strong>: <a href="http://arxiv.org/abs/1503.02101">Ge, Huang, Jin and Yuan 2015</a> considered adding occasional random perturbations to GD, and were able to provide the first <em>polynomial time</em> guarantee for GD to escape saddle points. (See also <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s post</a> )</p>
</li>
<li>
<p><strong>Random Initialization</strong>: <a href="http://arxiv.org/abs/1602.04915">Lee et al. 2016</a> showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s post</a>)</p>
</li>
</ol>
<p>Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.</p>
<p>One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.</p>
<p><strong><em>Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?</em></strong></p>
<p>A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably <em>inefficient</em> in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).</p>
<p>Somewhat surprisingly, it turns out that we obtain a rather different — and <em>positive</em> — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:</p>
<blockquote>
<p><strong>Perturbed gradient descent (PGD)</strong></p>
<ol>
<li><strong>for</strong> $~t = 1, 2, \ldots ~$ <strong>do</strong></li>
<li>$\quad\quad x_{t} \leftarrow x_{t-1} - \eta \nabla f (x_{t-1})$</li>
<li>$\quad\quad$ <strong>if</strong> $~$<em>perturbation condition holds</em>$~$ <strong>then</strong></li>
<li>$\quad\quad\quad\quad x_t \leftarrow x_t + \xi_t$</li>
</ol>
</blockquote>
<p>Here the perturbation $\xi_t$ is sampled uniformly from a ball centered at zero with a suitably small radius, and is added to the iterate when the gradient is suitably small. These particular choices are made for analytic convenience; we do not believe that uniform noise is necessary. nor do we believe it essential that noise be added only when the gradient is small.</p>
<h2 id="strict-saddle-and-second-order-stationary-points">Strict-Saddle and Second-order Stationary Points</h2>
<p>We define <em>saddle points</em> in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along <em>at least one direction</em>. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\lambda_{\min}(\nabla^2 f(x)) \begin{cases}
> 0 \quad\quad \text{local minimum} \\
= 0 \quad\quad \text{local minimum or saddle point} \\
< 0 \quad\quad \text{saddle point}
\end{cases} %]]></script>
<p>We further call the saddle points in the last category, where $\lambda_{\min}(\nabla^2 f(x)) < 0$, <strong>strict saddle points</strong>.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/strictsaddle.png" width="85%" alt="Strict and Non-strict Saddle Point" />
</p>
<p>While non-strict saddle points can be flat in the valley, strict saddle points require that there is <em>at least one direction</em> along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is <em>NP-hard</em>; therefore, we — and previous authors — focus on escaping <em>strict</em> saddle points.</p>
<p>Formally, we make the following two standard assumptions regarding smoothness.</p>
<blockquote>
<p><strong>Assumption 1</strong>: $f$ is $\ell$-gradient-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2, |\nabla f(x_1) - \nabla f(x_2)| \le \ell |x_1 - x_2|$. <br />
$~$<br />
<strong>Assumption 2</strong>: $f$ is $\rho$-Hessian-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2$, $|\nabla^2 f(x_1) - \nabla^2 f(x_2)| \le \rho |x_1 - x_2|$.</p>
</blockquote>
<p>Similarly to classical theory, which studies convergence to a first-order stationary point, $\nabla f(x) = 0$, by bounding the number of iterations to find a <strong>$\epsilon$-first-order stationary point</strong>, $|\nabla f(x)| \le \epsilon$, we formulate the speed of escape of strict saddle points and the ensuing convergence to a second-order stationary point, $\nabla f(x) = 0, \lambda_{\min}(\nabla^2 f(x)) \ge 0$, with an $\epsilon$-version of the definition:</p>
<blockquote>
<p><strong>Definition</strong>: A point $x$ is an <strong>$\epsilon$-second-order stationary point</strong> if:<br />
$\quad\quad\quad\quad |\nabla f(x)|\le \epsilon$, and $\lambda_{\min}(\nabla^2 f(x)) \ge -\sqrt{\rho \epsilon}$.</p>
</blockquote>
<p>In this definition, $\rho$ is the Hessian Lipschitz constant introduced above. This scaling follows the convention of <a href="http://rd.springer.com/article/10.1007%2Fs10107-006-0706-8">Nesterov and Polyak 2006</a>.</p>
<h3 id="applications">Applications</h3>
<p>In a wide range of practical nonconvex problems it has been proved that <strong>all saddle points are strict</strong> — such problems include, but not are limited to, principal components analysis, canonical correlation analysis,
<a href="http://arxiv.org/abs/1503.02101">orthogonal tensor decomposition</a>,
<a href="http://arxiv.org/abs/1602.06664">phase retrieval</a>,
<a href="http://arxiv.org/abs/1504.06785">dictionary learning</a>,
<!-- matrix factorization, -->
<a href="http://arxiv.org/abs/1605.07221">matrix sensing</a>,
<a href="http://arxiv.org/abs/1605.07272">matrix completion</a>,
and <a href="http://arxiv.org/abs/1704.00708">other nonconvex low-rank problems</a>.</p>
<p>Furthermore, in all of these nonconvex problems, it also turns out that <strong>all local minima are global minima</strong>. Thus, in these cases, any general efficient algorithm for finding $\epsilon$-second-order stationary points immediately becomes an efficient algorithm for solving those nonconvex problem with global guarantees.</p>
<h2 id="escaping-saddle-point-with-negligible-overhead">Escaping Saddle Point with Negligible Overhead</h2>
<p>In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:</p>
<blockquote>
<p><strong>Theorem (<a href="http://rd.springer.com/book/10.1007%2F978-1-4419-8853-9">Nesterov 1998</a>)</strong>: If Assumption 1 holds, then GD, with $\eta = 1/\ell$, finds an $\epsilon$-<strong>first</strong>-order stationary point in $2\ell (f(x_0) - f^\star)/\epsilon^2$ iterations.</p>
</blockquote>
<p>In this theorem, $x_0$ is the initial point and $f^\star$ is the function value of the global minimum. The theorem says for that any gradient-Lipschitz function, a stationary point can be found by GD in $O(1/\epsilon^2)$ steps, with no explicit dependence on $d$. This is called “dimension-free optimization” in the literature; of course the cost of a gradient computation is $O(d)$, and thus the overall runtime of GD scales as $O(d)$. The linear scaling in $d$ is especially important for modern high-dimensional nonconvex problems such as deep learning.</p>
<p>We now wish to address the corresponding problem for second-order stationary points.
What is the best we can hope for? Can we also achieve</p>
<ol>
<li>A dimension-free number of iterations;</li>
<li>An $O(1/\epsilon^2)$ convergence rate;</li>
<li>The same dependence on $\ell$ and $(f(x_0) - f^\star)$ as in (Nesterov 1998)?</li>
</ol>
<p>Rather surprisingly, the answer is <em>Yes</em> to all three questions (up to small log factors).</p>
<blockquote>
<p><strong>Main Theorem</strong>: If Assumptions 1 and 2 hold, then PGD, with $\eta = O(1/\ell)$, finds an $\epsilon$-<strong>second</strong>-order stationary point in $\tilde{O}(\ell (f(x_0) - f^\star)/\epsilon^2)$ iterations with high probability.</p>
</blockquote>
<p>Here $\tilde{O}(\cdot)$ hides only logarithmic factors; indeed, the dimension dependence in our result is only $\log^4(d)$. The theorem thus asserts that a perturbed form of GD, under an additional Hessian-Lipschitz condition, <strong><em>converges to a second-order-stationary point in almost the same time required for GD to converge to a first-order-stationary point.</em></strong> In this sense, we claim that PGD can escape strict saddle points almost for free.</p>
<p>We turn to a discussion of some of the intuitions underlying these results.</p>
<h3 id="why-do-polylogd-iterations-suffice">Why do polylog(d) iterations suffice?</h3>
<p>Our strict-saddle assumption means that there is only, in the worst case, one direction in $d$ dimensions along which we can escape. A naive search for the descent direction intuitively should take at least $\text{poly}(d)$ iterations, so why should only $\text{polylog}(d)$ suffice?</p>
<p>Consider a simple case in which we assume that the function is quadratic in the neighborhood of the saddle point. That is, let the objective function be $f(x) = x^\top H x$, a saddle point at zero, with constant Hessian $H = \text{diag}(-1, 1, \cdots, 1)$. In this case, only the first direction is an escape direction (with negative eigenvalue $-1$).</p>
<p>It is straightforward to work out the general form of the iterates in this case:</p>
<script type="math/tex; mode=display">x_t = x_{t-1} - \eta \nabla f(x_{t-1}) = (I - \eta H)x_{t-1} = (I - \eta H)^t x_0.</script>
<p>Assume that we start at the saddle point at zero, then add a perturbation so that $x_0$ is sampled uniformly from a ball $\mathcal{B}_0(1)$ centered at zero with radius one.
The decrease in the function value can be expressed as:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = x_t^\top H x_t = x_0^\top (I - \eta H)^t H (I - \eta H)^t x_0.</script>
<p>Set the step size to be $1/2$, let $\lambda_i$ denote the $i$-th eigenvalue of the Hessian $H$ and let $\alpha_i = e_i^\top x_0$ denote the component in the $i$th direction of the initial point $x_0$. We have $\sum_{i=1}^d \alpha_i^2 = | x_0|^2 = 1$, thus:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = \sum_{i=1}^d \lambda_i (1-\eta\lambda_i)^{2t} \alpha_i^2 \le -1.5^{2t} \alpha_1^2 + 0.5^{2t}.</script>
<p>A simple probability argument shows that sampling uniformly in $\mathcal{B}_0(1)$ will result in at least a $\Omega(1/d)$ component in the first direction with high probability. That is, $\alpha^2_1 = \Omega(1/d)$. Substituting $\alpha_1$ in the above equation, we see that it takes at most $O(\log d)$ steps for the function value to decrease by a constant amount.</p>
<h3 id="pancake-shape-stuck-region-for-general-hessian">Pancake-shape stuck region for general Hessian</h3>
<p>We can conclude that for the case of a constant Hessian, only when the perturbation $x_0$ lands in the set $\{x | ~ |e_1^\top x|^2 \le O(1/d)\}$ $\cap \mathcal{B}_0 (1)$, can we take a very long time to escape the saddle point. We call this set the <strong>stuck region</strong>; in this case it is a flat disk. In general, when the Hessian is no longer constant, the stuck region becomes a non-flat pancake, depicted as a green object in the left graph. In general this region will not have an analytic expression.</p>
<p>Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — <em>although we don’t know the shape of the stuck region, we know it is very thin</em>.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/flow.png" width="85%" alt="Pancake" />
</p>
<p>In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points $w, u$ separated by $O(1/\sqrt{d})$ along an escaping direction. We claim that if we run GD starting at $w$ and $u$, at least one of the resulting trajectories will escape the saddle point very quickly. This implies that the thickness of the stuck region can be at most $O(1/\sqrt{d})$, so a random perturbation has very little chance to land in the stuck region.</p>
<h2 id="on-the-necessity-of-adding-perturbations">On the Necessity of Adding Perturbations</h2>
<p>We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent <a href="http://arxiv.org/abs/1705.10412">joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh</a>, we have shown that even with fairly natural random initialization schemes and non-pathological functions, <strong>GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.</strong></p>
<p>To establish this result, we considered random initializations from a very general class including Gaussians and uniform distributions over the hypercube, and we constructed a smooth objective function that satisfies both Assumptions 1 and 2. This function is constructed such that, even with random initialization, with high probability both GD and PGD have to travel sequentially in the vicinity of $d$ strict saddle points before reaching a local minimum. All strict saddle points have only one direction of escape. (See the left graph for the case of $d=2$).</p>
<p><img src="/assets/saddle_eff/necesperturbation.png" alt="NecessityPerturbation" /></p>
<p>When GD travels in the vicinity of a sequence of saddle points, it can get closer and closer to the later saddle points, and thereby take longer and longer to escape. Indeed, the time to escape the $i$th saddle point scales as $e^{i}$. On the other hand, PGD is always able to escape any saddle point in a small number of steps independent of the history. This phenomenon is confirmed by our experiments; see, for example, an experiment with $d=10$ in the right graph.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.</p>
<p>There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.</p>
Wed, 19 Jul 2017 03:00:00 -0700
http://offconvex.github.io/2017/07/19/saddle-efficiency/
http://offconvex.github.io/2017/07/19/saddle-efficiency/Do GANs actually do distribution learning?<p>This post is about our new paper, which presents empirical evidence that current GANs (Generative Adversarial Nets) are quite far from learning the target distribution. Previous posts had <a href="http://www.offconvex.org/2017/03/15/GANs/">introduced GANs</a> and described <a href="http://www.offconvex.org/2017/03/30/GANs2/">new theoretical analysis of GANs</a> from <a href="https://arxiv.org/abs/1703.00573">our ICML17 paper</a>. One notable implication of our theoretical analysis was that when the discriminator size is bounded, then GANs training could appear to succeed (i.e., training objective reaches its optimum value) even if the generated distribution is discrete and has very low support —-in other words, the training objective is unable to prevent even extreme <em>mode collapse</em>.</p>
<p>That paper led us (especially Sanjeev) into spirited discussions with colleagues, who wondered if this is <em>just</em> a theoretical result about potential misbehavior rather than a prediction about real-life training. After all, we’ve all seen the great pictures that GANs produce in real life, right? (Note that the theoretical result only describes a possible near-equilibrium that can arise with a certain mix of hyperparameters, and conceivably real-life training avoids that by suitable hyperparameter tuning.)</p>
<p>Our new empirical paper <a href="https://arxiv.org/abs/1706.08224v1">Do GANs actually learn the distribution? An empirical study</a> puts the issue to the test. We present empirical evidence that well-known GANs approaches do end up learning distributions of fairly low support, and thus presumably are not learning the target distribution.</p>
<p>Let’s start by imagining how large the support must be for the target distribution. For example, if the distribution is the set of all possible images of human faces (real or imagined), then these must involve all combinations of hair color/style, facial features, complexion, expression, pose, lighting, race, etc., and thus the possible set of images of faces that <em>humans will consider to be distinct</em> approaches infinity. (After all, there are billions of distinct people living on earth right now.)
GANs are trying to learn this full distribution using a finite sample of images, say <a href="http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html">CelebA</a> which has $200,000$ images of celebrity faces.</p>
<p>Thus a simple sanity check for whether a GAN has truly come close to learning this distribution is to estimate how many “distinct” images it can produce. At first glance, such an estimation seems very difficult. After all, automated/heuristic measures of image similarity can be easily fooled, and we humans surely don’t have enough time to go through millions or billions of images, right?</p>
<p>Luckily, a crude estimate is possible using the simple birthday paradox, a staple of undergrad discrete math.</p>
<h2 id="birthday-paradox-test-for-size-of-the-support">Birthday paradox test for size of the support</h2>
<p>Imagine for argument’s sake that the human race were limited to a genetic diversity of a million —nature’s laws only allow this many distinct humans. How would this hard limit manifest itself in our day to day life? The birthday paradox says that if we take a random sample of a thousand people —note that most of us get to know this many people easily in our lifetimes—we’d see many <a href="https://en.wikipedia.org/wiki/Doppelg%C3%A4nger_">doppelgangers</a>. Of course, in practice the only doppelgangers we encounter happen to be identical twins.</p>
<p>Formally, the birthday paradox says that if a discrete distribution has support $N$, then a random sample of size about
$\sqrt{N}$ would be quite likely to contain a duplicate. (The name comes from its implication that if you put $23 \approx \sqrt{365}$ random people in a room, the chance that two of them have the same birthday is about $1/2$.)</p>
<p>In the GAN setting, the distribution is continuous, not discrete. Thus our proposed birthday paradox test for GANs is as follows.</p>
<p>(a) Pick a sample of size $s$ from the generated distribution. (b) Use an automated measure of image similarity to flag the $20$ (say) most similar pairs in the sample. (c) Visually inspect the flagged pairs and check for images that a human would consider near-duplicates. (d) Repeat.</p>
<p>If this test reveals that samples of size $s$ have duplicate images with good probability, then suspect that the distribution has support size about $s^2$.</p>
<p>Note that the test is not definitive, because the distribution could assign say a probability $10\%$ to a single image, and be uniform on a huge number of other images. Then the test would be quite likely to find a duplicate even with $20$ samples, even though the true support size is huge. But such nonuniformity (a lot of probability being assigned to a few images) is the only failure mode of the birthday paradox test calculation, and such nonuniformity would itself be considered a failure mode of GANs training. The CIFAR-10 samples below show that such nonuniformality can be severe in practice, where the generator tends to generate a fixed image of automobile very likely. On CIFAR-10, this failure mode is also observed in classes of frogs and cats.</p>
<h2 id="experimental-results">Experimental results.</h2>
<p>Our test was done using two datasets, CelebA (faces) and CIFAR-10.</p>
<p>For faces, we found Euclidean distance in pixel space works well as a heuristic similarity measure, probably because the samples are centered and aligned. For CIFAR-10, we pre-train a discriminative Convolutional Neural Net for the full classification problem, and use the top layer representation as an embedding of the image. Heuristic similarity is then measured as the Euclidean distance in the embedding space. Possibly these similarity measures are crude, but note that improving them can only <em>lower</em> our estimate of the support size of the distribution, since a better similarity measure can only increase the number of duplicates found. Thus our estimates below should be considered as upper bounds on the support size of the distribution.</p>
<h2 id="results--on-celeba-dataset">Results on CelebA dataset</h2>
<p>We tested the following methods, doing the birthday paradox test with Euclidean distance in pixel space as the heuristic similarity measure.</p>
<ul>
<li>DCGAN —unconditional as described in <a href="https://arxiv.org/abs/1406.2661">Goodfellow et al. 2014</a> and <a href="https://arxiv.org/abs/1511.06434">Radford et al. 2015</a></li>
<li>MIX+GAN protocol introduced in <a href="https://arxiv.org/abs/1703.00573">Arora et al.</a>, specifically, MIX+DCGAN with $3$ mixture components.</li>
<li><a href="https://arxiv.org/abs/1606.00704">Adversarily Learned Inference (ALI)</a> (or equivalently <a href="https://arxiv.org/abs/1605.09782">BiGANs</a>). (ALI is probabilistic version of BiGANs, but their architectures are equivalent. So we only tested ALI in our experiments.)</li>
</ul>
<p>We find that with probability $\geq50\%$, a batch of about $400$ samples contains at least one pair of duplicates for both DCGAN and MIX+DCGAN. The figure below give examples duplicates and their nearest neighbors samples (that we could fine) in training set. These results suggest that the support size of the distribution is less than $400^2\approx160000$, which is actually lower than the diversity of the training set, but this distribution is not just memorizing the training set.</p>
<p>ALI (or BiGANs) appear to be somewhat more diverse, in that collisions appear with $50\%$ probability only with a batch size of $1000$, implying a support size of a million. This is $5$x the training set, but still much smaller than the diversity one would expect among human faces (After all doppelgangers don’t appear in samples of a few thousand people in real life.) For fair comparison, we set the discriminator of ALI (or BiGANs) to be roughly the same in size as that of the DCGAN model, since the results below suggests that the discriminator size has a strong effect on diversity of the learnt distribution.) Nevertheless, these tests do support the suggestion that the bidirectional structure prevents some of the mode collapses observed in usual GANs.</p>
<p><img src="https://www.dropbox.com/s/7v2qbs4i82cczsy/similar_face_pairs.png?dl=1" alt="similar_face_pairs" /></p>
<h2 id="diversity-vs-discriminator-size">Diversity vs Discriminator Size</h2>
<p>The analysis of <a href="https://arxiv.org/abs/1703.00573">Arora et al.</a> suggested that the support size could be as low as near-linear in the capacity of the discriminator; in other words, there is a near-equilibrium in which a distribution of such a small support could suffice to fool the best discriminator. So it is worth investigating whether training in real life allows generator nets to exploit this “loophole” in the training that we now know is in principle available to them.</p>
<p>We built DCGANs with increasingly larger discriminators while fixing the other hyper-parameters. The discriminator used here is a 5-layer Convolutional Neural Network such that the number of output channels of each layer is $1\times,2\times,4\times,8\times\textit{dim}$ where $dim$ is chosen to be $16,32,48,64,80,96,112,128$. Thus the discriminator size should be proportional to $dim^2$. The figure below suggests that in this simple setup the diversity of the learnt distribution does indeed grow near-linearly with the discriminator size. (Note the diversity is seen to plateau, possibly because one needs to change other parameters like depth to meaningfully add more capacity to the discriminator.)</p>
<p><img src="https://www.dropbox.com/s/zmhprwu2w2rddep/diversity_vs_size.png?dl=1" alt="diversity_vs_size" /></p>
<h2 id="results-for-cifar-10">Results for CIFAR-10</h2>
<p>On CIFAR-10, as mentioned earlier, we use a heuristic image similarity computed with convolutional neural net with 3 convolutional layers, 2 fully-connected layer and a 10-class soft-max output pretrained with a multi-class classification objective. Specifically, the top layer features are viewed as embeddings for similarity test using Euclidean distance.
We found that this heuristic similarity test quickly becomes useless if the samples display noise artifacts, and thus was effective only on the very best GANs that generate the most real-looking images. For CIFAR-10
this led us to <a href="https://arxiv.org/abs/1612.04357">Stacked GAN</a>, currently believed to be the best generative model on CIFAR-10 (Inception Score $8.59$). Since this model is trained by conditioning on class label, we measure its diversity within each class separately.</p>
<p>The training set for each class has $10k$ images, but since the generator is allowed to learn from all classes, presumably it can mix and match (especially background, lighting, landscape etc.) between classes and learn a fairly rich set of images.</p>
<p>Now we list the batch sizes needed for duplicates to appear.</p>
<p><img src="https://www.dropbox.com/s/bumdhzlcrk1z97b/cifar_diversity_table.png?dl=1" alt="cifar_diversity_table" /></p>
<p>As before, we show duplicate samples as well as the nearest neighbor to the samples in training set (identified by using heuristic similarity measure to flag possibilities and confirming visually).</p>
<p><img src="https://www.dropbox.com/s/8itrpjngrc13eam/selected_similar_cifar_samples.png?dl=1" alt="similar_cifar_samples" /></p>
<p>We find that the closest image is quite different from the duplicate detected, which suggests the issue with GANs is indeed lack of diversity (low support size) instead of memorizing training set. (See <a href="https://arxiv.org/abs/1706.08224v1">the paper</a> for more examples.)</p>
<p>Note that by and large the diversity of the learnt distribution is higher than that of the training set, but still not as high as one would expect in terms of all possible combinations.</p>
<h1 id="birthday-paradox-test-for-vaes">Birthday paradox test for VAEs</h1>
<p><img src="https://www.dropbox.com/s/p1qlgr66rnufnal/vae_collisions.png?dl=1" alt="vae_collisions" /></p>
<p>Given these findings, it is natural to wonder about the diversity of distributions learned using earlier methods such as <a href="https://arxiv.org/abs/1312.6114">Variational Auto-Encoders</a> (VAEs). Instead of using feedback from the discriminator, these methods train the generator net using feedback from an approximate perplexity calculation. Thus the analysis of <a href="https://arxiv.org/abs/1703.00573">Arora et al.</a> does not apply as is to such methods and it is conceivable they exhibit higher diversity. However, we found the birthday paradox test difficult to run since samples from a VAE trained on CelebA were not realistic or sharp enough for a human to definitively conclude whether or not two images were almost the same. The figure above shows examples of collision candidates found in batches of 400 samples; clearly some indicative parts (hair, eyes, mouth, etc.) are quite blurry in VAE samples.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Our new birthday paradox test seems to suggest that some well-regarded GANs are currently learning distributions that with rather low support (i.e., suffer mode collapse). The possibility of such a scenario was anticipated in the theoretical analysis of (<a href="https://arxiv.org/abs/1703.00573">Arora et al.</a>) reported in an earlier post.</p>
<p>This combination of theory and empirics raises the open problem of how to change the GANs training to avoid such mode collapse. Possibly ALI/BiGANs point to the right direction, since they exhibit somewhat better diversity in our experiments. One should also try tuning of hyperparameter/architecture in current methods now that the birthday paradox test gives a concrete way to quantify mode collapse.</p>
<p>Finally, we should consider the possibility that the best use of GANs and related techniques could be feature learning or some other goal, as opposed to distribution learning. This needs further theoretical and empirical exploration.</p>
Thu, 06 Jul 2017 23:00:00 -0700
http://offconvex.github.io/2017/07/06/GANs3/
http://offconvex.github.io/2017/07/06/GANs3/Unsupervised learning, one notion or many?<p><em>Unsupervised learning</em>, as the name suggests, is the science of learning from unlabeled data. A look at the <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">wikipedia page</a> shows that this term has many interpretations:</p>
<p><strong>(Task A)</strong> <em>Learning a distribution from samples.</em> (Examples: gaussian mixtures, topic models, variational autoencoders,..)</p>
<p><strong>(Task B)</strong> <em>Understanding latent structure in the data.</em> This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.</p>
<p><strong>(Task C)</strong> <em>Feature Learning.</em> Learn a mapping from <em>datapoint</em> $\rightarrow$ <em>feature vector</em> such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of <em>labeled</em> samples needed for learning a classifier, or be useful for <a href="https://en.wikipedia.org/wiki/Domain_adaptation"><em>domain adaptation</em></a>.</p>
<p>Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.</p>
<p>This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that for deep learning, unsupervised pretraining should help supervised training, but this has been hard to show in practice.</p>
<h2 id="the-common-theme-high-level-representations">The common theme: high level representations.</h2>
<p>If $x$ is a datapoint, each of these methods seeks to map it to a new “high level” representation $h$ that captures its “essence.”
This is why it helps to have access to $h$ when performing machine learning tasks on $x$ (e.g. classification).
The difficulty of course is that “high-level representation” is not uniquely defined. For example, $x$ may be an image, and $h$ may contain the information that it contains a person and a dog. But another $h$ may say that it shows a poodle and a person wearing pyjamas standing on the beach. This nonuniqueness seems inherent.</p>
<p>Unsupervised learning tries to learn high-level representation using unlabeled data.
Each method make an implicit assumption about how the hidden $h$ relates to the visible $x$. For example, in k-means clustering the hidden $h$ consists of labeling the datapoint with the index of the cluster it belongs to.
Clearly, such a simple clustering-based representation has rather limited expressive power since it groups datapoints into disjoint classes: this limits its application for complicated settings. For example, if one clusters images according to the labels “human”, “animal” “plant” etc., then which cluster should contain an image showing a man and a dog standing in front of a tree?</p>
<p>The search for a descriptive language for talking about the possible relationships of representations and data leads us naturally to Bayesian models. (Note that these are viewed with some skepticism in machine learning theory – compared to assumptionless models like PAC learning, online learning, etc. – but we do not know of another suitable vocabulary in this setting.)</p>
<h2 id="a-bayesian-view">A Bayesian view</h2>
<p>Bayesian approaches capture the relationship between the “high level” representation $h$ and the datapoint $x$ by postulating a <em>joint distribution</em> $p_{\theta}(x, h)$ of the data $x$ and representation $h$, such that $p_{\theta}(h)$ and the posterior $p_{\theta}(x \mid h)$ have a simple form as a function of the parameters $\theta$. These are also called <em>latent variable</em> probabilistic models, since $h$ is a latent (hidden) variable.</p>
<p>The standard goal in distribution learning is to find the $\theta$ that “best explains” the data (what we called Task (A)) above). This is formalized using maximum-likelihood estimation going back to Fisher (~1910-1920): find the $\theta$ that maximizes the <em>log probability</em> of the training data. Mathematically, indexing the samples with $t$, we can write this as</p>
<script type="math/tex; mode=display">\max_{\theta} \sum_{t} \log p_{\theta}(x_t) \qquad (1)</script>
<p>where
<script type="math/tex">p_{\theta}(x_t) = \sum_{h_t}p_{\theta}(x_t, h_t).</script></p>
<p>(Note that $\sum_{t} \log p_{\theta}(x_t)$ is also the empirical estimate of the <em>cross-entropy</em>
$E_{x}[\log p_{\theta}(x)]$ of the distribution $p_{\theta}$, where $x$ is distributed according to $p^*$, the true distribution of the data. Thus the above method looks for the distribution with best cross-entropy on the empirical data, which is also log of the <a href="https://en.wikipedia.org/wiki/Perplexity"><em>perplexity</em></a> of $p_{\theta}$.)</p>
<p>In the limit of $t \to ∞$, this estimator is <em>consistent</em> (converges in probability to the ground-truth value) and <em>efficient</em> (has lowest asymptotic mean-square-error among all consistent estimators). See the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">Wikipedia page</a>. (Aside: maximum likelihood estimation is often NP-hard, which is one of the reasons for the renaissance of the method-of-moments and tensor decomposition algorithms in learning latent variable models, which <a href="http://www.offconvex.org/2015/12/17/tensor-decompositions/">Rong wrote about some time ago</a>.)</p>
<h3 id="toward-task-c-representations-arise-from-the-posterior-distribution">Toward task C: Representations arise from the posterior distribution</h3>
<p>Simply learning the distribution $p_{\theta}(x, h)$ does not yield a representation <em>per se.</em> To get a distribution of $x$, we need access to the posterior $p_{\theta}(h \mid x)$: then a sample from this posterior can be used as a “representation” of a data-point $x$. (Aside: Sometimes, in settings when $p_{\theta}(h \mid x)$ has a simple description, this description can be viewed as the representation of $x$.)</p>
<p>Thus solving Task C requires learning distribution parameters $\theta$ <em>and</em> figuring out how to efficiently sample from the posterior distribution.</p>
<p>Note that the sampling problems for the posterior can be #-P hard for very simple families. The reason is that by Bayes law, $p_{\theta}(h \mid x) = \frac{p_{\theta}(h) p_{\theta}(x \mid h)}{p_{\theta}(x)}$. Even if the numerator is easy to calculate, as is the case for simple families, the $p_{\theta}(x)$ involves a big summation (or integral) and is often hard to calculate.</p>
<p>Note that the max-likelihood parameter estimation (Task A) and approximating the posterior distributions $p(h \mid x)$ (Task C) can have radically different complexities: Sometimes A is easy but C is NP-hard (example: topic modeling with “nice” topic-word matrices, but short documents, see also <a href="https://arxiv.org/abs/1411.6156">Bresler 2015</a>); or vice versa (example: topic modeling with long documents, but worst-case chosen topic matrices <a href="https://arxiv.org/abs/1111.0952">Arora et al. 2011</a>)</p>
<p>Of course, one may hope (as usual) that computational complexity is a worst-case notion and may not apply in practice. But there is a bigger issue with this setup, having to do with accuracy.</p>
<h2 id="why-the-above-reasoning-is-fragile-need-for-high-accuracy">Why the above reasoning is fragile: Need for high accuracy</h2>
<p>The above description assumes that the parametric model $p_{\theta}(x, h)$ for the data was <em>exact</em> whereas one imagines it is only <em>approximate</em> (i.e., suffers from modeling error). Furthermore, computational difficulties may restrict us to use approximately correct inference even if the model were exact. So in practice, we may only have an <em>approximation</em> $q(h|x)$ to
the posterior distribution $p_{\theta}(h \mid x)$. (Below we describe a popular methods to compute such approximations.)</p>
<blockquote>
<p><em>How good of an approximation</em> to the true posterior do we need?</p>
</blockquote>
<p>Recall, we are trying to answer this question through the lens of Task C, solving some classification task. We take the following point of view:</p>
<blockquote>
<p>For $t=1, 2,\ldots,$ nature picked some $(h_t, x_t)$ from the joint distribution and presented us $x_t$. The true label $y_t$ of $x_t$ is $\mathcal{C}(h_t)$ where $\mathcal{C}$ is an unknown classifier. Our goal is classify according to these labels.</p>
</blockquote>
<p>To simplify notation, assume the output of $\mathcal{C}$ is binary. If we wish to use
$q(h \mid x)$ as a surrogate for the true posterior $p_{\theta}(h \mid x)$, we need to have
$\Pr_{x_t, h_t \sim q(\cdot \mid x_t)} [\mathcal{C}(h_t) \neq y_t]$ is small as well.</p>
<p>How close must $q(h \mid x)$ and $p(h \mid x)$ be to let us conclude this? We will use KL divergence as “distance” between the distributions, for reasons that will become apparent in the following section. We claim the following:</p>
<blockquote>
<p>CLAIM: The probability of obtaining different answers on classification tasks done using the ground truth $h$ versus the representations obtained using $q(h_t \mid x_t)$ is less than $\epsilon$ if $KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t)) \leq 2\epsilon^2.$</p>
</blockquote>
<p>Here’s a proof sketch. The natural distance these two distributions $q(h \mid x)$ and $p(h \mid x)$ with respect to accuracy of classification tasks is <em>total variation (TV)</em> distance. Indeed, if the TV distance between $q(h\mid x)$ and $p(h \mid x)$ is bounded by $\epsilon$, this implies that for any event $\Omega$,</p>
<script type="math/tex; mode=display">\left|\Pr_{h_t \sim p(\cdot \mid x_t)}[\Omega] - \Pr_{h_t \sim q(\cdot \mid x_t)}[\Omega]\right| \leq \epsilon .</script>
<p>The CLAIM now follows by instantiating this with the event $\Omega = $ “Classifier $\mathcal{C}$ outputs a different answer from $y_t$ given representation $h_t$ for input $x_t$”, and relating TV distance to KL divergence using <a href="https://en.wikipedia.org/wiki/Pinsker%27s_inequality">Pinsker’s inequality</a>, which gives</p>
<script type="math/tex; mode=display">\mbox{TV}(q(h_t \mid x_t),p(h_t \mid x_t)) \leq \sqrt{\frac{1}{2} KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t))} \leq \epsilon</script>
<p>as we needed. This observation explains why solving Task A in practice does not automatically lead to very useful representations for classification tasks (Task C): the posterior distribution has to be learnt extremely accurately, which probably didn’t happen (either due to model mismatch or computational complexity).</p>
<h2 id="the--link-between-tasks-a-and-c-variational-methods">The link between Tasks A and C: variational methods</h2>
<p>As noted, distribution learning (Task A) via cross-entropy/maximum-likelihood fitting, and representation learning (Task C) via sampling the posterior are fairly distinct. Why do students often conflate the two? Because in practice the most frequent way to solve Task A does implicitly compute posteriors and thus also solves Task C.</p>
<p>The generic way to learn latent variable models involves variational methods, which can be viewed as a generalization of the famous EM algorithm (<a href="http://web.mit.edu/6.435/www/Dempster77.pdf">Dempster et al. 1977</a>).</p>
<p>Variational methods maintain at all times a <em>proposed distribution</em> $q(h | x)$ (called <em>variational distribution</em>). The methods rely on the observation that for every such $q(h \mid x)$ the following lower bound holds
\begin{equation} \log p(x) \geq E_{q(h \mid x)} \log p(x,h) + H(q(h\mid x)) \qquad (2). \end{equation}
where $H$ denotes Shannon entropy (or differential entropy, depending on whether $x$ is discrete or continuous). The RHS above is often called the <em>ELBO bound</em> (ELBO = evidence-based lower bound). This inequality follows from a bit of algebra using non-negativity of KL divergence, applied to distributions $q(h \mid x)$ and $p(h\mid x)$. More concretely, the chain of inequalities is as follows,</p>
<script type="math/tex; mode=display">KL(q(h\mid x) \parallel p(h \mid x)) \geq 0 \Leftrightarrow E_{q(h|x)} \log \frac{q(h|x)}{p(h|x)} \geq 0</script>
<script type="math/tex; mode=display">\Leftrightarrow E_{q(h|x)} \log \frac{q(h|x)}{p(x,h)} + \log p(x) \geq 0</script>
<script type="math/tex; mode=display">\Leftrightarrow \log p(x) \geq E_{q(h|x)} \log p(x,h) + H(q(h\mid x))</script>
<p>Furthermore, <em>equality</em> is achieved if $q(h\mid x) = p(h\mid x)$. (This can be viewed as some kind of “duality” theorem for distributions, and dates all the way back to Gibbs. )</p>
<p>Algorithmically observation (2) is used by foregoing solving the maximum-likelihood optimization (1), and solving instead</p>
<script type="math/tex; mode=display">\max_{\theta, q(h_t|x_t)} \sum_{t} E_{q(h_t\mid x_t)} \log p(x_t,h_t) + H(q(h_t\mid x_t))</script>
<p>Since the variables are naturally divided into two blocks: the model parameters $\theta$, and the variational distributions $q(h_t\mid x_t)$, a natural way to optimize the above is to <em>alternate</em> optimizing over each group, while keeping the other fixed. (This meta-algorithm is often called variational EM for obvious reasons.)</p>
<p>Of course, optimizing over all possible distributions $q$ is an ill-defined problem, so $q$ is constrained to lie in some parametric family (e.g., “ standard Gaussian transformed by depth $4$ neural nets of certain size and architecture”) such the above objective can be easily evaluated at least (typically it has a closed-form expression).</p>
<p>Clearly if the parametric family of distributions is expressive enough, and the (non-convex) optimization problem doesn’t get stuck in bad local minima, then variational EM algorithm will give us not only values of the parameters $\theta$ which are close to the ground-truth ones, but also variational distributions $q(h\mid x)$ which accurately track $p(h\mid x)$. But as we saw above, this accuracy would need to be very high to get meaningful representations.</p>
<h2 id="next-post">Next Post</h2>
<p>In the next post, we will describe our recent work further clarifying this issue of representation learning via a Bayesian viewpoint.</p>
Mon, 26 Jun 2017 21:00:00 -0700
http://offconvex.github.io/2017/06/26/unsupervised1/
http://offconvex.github.io/2017/06/26/unsupervised1/Generalization and Equilibrium in Generative Adversarial Networks (GANs)<p>The <a href="http://www.offconvex.org/2017/03/15/GANs/">previous post</a> described Generative Adversarial Networks (GANs), a technique for training generative models for image distributions (and other complicated distributions) via a 2-party game between a generator deep net and a discriminator deep net. This post describes <a href="https://arxiv.org/abs/1703.00573">my new paper</a> with Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. We address some fundamental issues about generalization in GANs that have been debated since the beginning; e.g., what is the sense in which the learnt distribution close to the target distribution, and also what kind of equilibrium exists between generator and discriminator.</p>
<p>The usual analysis of GANs, sketched in the previous post, assumes “sufficiently large number of samples and sufficiently large discriminator nets” to conclude that at the end of training the learnt distribution should be close to the target distribution. Our new analysis, which accounts for the finite capacity of the discriminator net, calls this into question.</p>
<p>Readers looking for new implementation ideas can skip ahead to the section below on our <em>Mix + GAN</em> protocol. It takes other GANs codes as black box and (by adding extra capacity and corresponding training time) often improves the learnt distribution in qualitative and quantitative measures. Our testing suggests that it works well out of the box.</p>
<p><strong>Notation</strong> Assume images are represented as vectors in $\Re^d$. Typically $d$ would be $1000$ or much higher. The <em>capacity</em> of the discriminator, namely, the number of trainable parameters, is denoted $n$. The distribution on all real-life images is denoted $P_{real}$. We assume that the number of distinct images in
$P_{real}$ —regardless of how one defines “distinct”—is enormous compared to all these parameters.</p>
<p>Recall that the discriminator $D$ is trained to distinguish between samples from $P_{real}$ and samples from the generator’s distribution
$P_{synth}$. This can be formalized using different measures (leading to different GANs objectives) and for simplicity our exposition here uses the <em>distinguishing probability</em> which is used in <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a> objective:</p>
<script type="math/tex; mode=display">|~E_{x \in P_{real}}[D(x)] - E_{x \in P_{synth}}[D(x)]|\qquad (1).</script>
<p>(Readers with a background in theoretical CS and cryptography will be reminded of similar definitions in theory of pseudorandom generators.)</p>
<h2 id="finite-discriminators-have-limited-power">Finite Discriminators have limited power</h2>
<p>The following simple fact shows why ignoring the discriminator’s capacity constraint can lead to grossly incorrect intuition.
(The constant $C$ is fairly small and explained in the paper.)</p>
<blockquote>
<p><strong>Theorem 1</strong> Suppose the discriminator has capacity $n$. Then expression (1) is less than $\epsilon$ when $P_{synth}$ is the following: uniform distribution on a random sample of $C n/\epsilon^2 \log n$ images from $P_{real}$.</p>
</blockquote>
<p>Note that this theorem is not a formalization of the usual failure mode discussed in GANs literature, whereby the generator simply memorizes the training images. The theorem still applies if we allow the discriminator to use a large set of held out images from $P_{real}$, which are completely different than images in $P_{synth}$. Or, if the training set of images is much larger than $C n/\epsilon^2 \log n$ images. Furthermore, common measures of diversity/novelty used in research papers (e.g., pick a random image from $P_{synth}$ and check the “distance” to the nearest neighbor among the training set) are not guaranteed to get around the problem raised by Theorem 1.</p>
<p>Since $Cn/\epsilon^2\log n$ is rather small, this theorem says that a finite-capacity discriminator is unable to even enforce that $P_{synth}$ has large <em>diversity</em>, let alone enforce that $P_{synth}\approx P_{real}$. The theorem does not imply that existing GANs do not <em>in practice</em> generate diverse distributions; merely that the current analyses gives no reason to believe that they do so.</p>
<p>The proof of the Theorem is a standard sampling argument from learning theory: take an $\epsilon$-net in the continuum of all deep nets of capacity $n$ and a fixed architecture, and do a union bound. Please see the paper for details. (Aside: the “$\epsilon^2$” term in Theorem 1 arises from this argument, and is ubiquitous in ML theory.)</p>
<p>Motivated by this theorem, we argue in the paper that the correct way to think about generalization for GANs is not the usual distance functions between distributions such as Jensen-Shannon or Wasserstein, but a new distance we define called <em>Neural net distance</em>. The neural net distance measures the ability of finite-capacity deep nets to distinguish the distributions. It can be small even when the other distances are large (as illustrated in the above theorem).</p>
<h3 id="corollary-larger-training-sets-have-limited-utility">Corollary: Larger training sets have limited utility</h3>
<p>In fact theorem 1 has the following rephrasing. Suppose we have a very large training set of images. If the discriminator has capacity $n$, then it suffices to take a subsample of size $C n/\epsilon^2 \log n$ from this training set, and we are guaranteed that GANs training using this subsample is capable of achieving a training objective that is within $\epsilon$ of the best achieved with the full training set. ( Any more samples can only improve the training objective by at most $\epsilon$. )</p>
<h2 id="existence-of-equilibrium">Existence of Equilibrium</h2>
<p>Let’s recall the objective used in GAN (for simplicity, we again stick with the Wasserstein GAN):</p>
<p><script type="math/tex">\min_{G} \max_{D}~~E_{x\sim P_{real}}[D(x)] - E_{h}[D(G(h))] \qquad (1)</script>
where $G$ is the generator net, and $P_{synth}$ is the distribution of $G(h)$ where $h$ is a random seed. Researchers have noted that this is implicitly a $2$-person game and it may not have an equilibrium; e.g., see the discussion around Figure 22 in <a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s survey</a>. An <em>equilibrium</em> corresponds to a $G$ and a $D$ such that the pair are still a solution if we switch the order of min and max in (1). (That is, $G$ doesn’t have incentive to switch in response to $D$, and vice versa.) Lack of equilibrium may be a cause of oscillatory behavior observed in training.</p>
<p>But ideally we wish to show something stronger than mere <em>existence</em> of equilibrium: we wish to exhibit an equilibrium where the generator <em>wins</em>, with the objective above at zero or close to zero (in other words, discriminator is unable to distinguish between the two distributions).</p>
<p>We will prove existence of an $\epsilon$-<em>approximate equilibrium</em>, whereby switching the order of $G, D$ affects the expression by at most $\epsilon$. (That is, $G$ has only limited incentive to switch in respnse to $D$ and vice versa.) Naively one would imagine that proving such a result involves some insight into the distribution $P_{real}$ but surprisingly none is needed.</p>
<blockquote>
<p><strong>Theorem 2</strong> If a generator net of capacity $T$ is able to generate a Gaussian distribution in $\Re^d$, then there exists an $\epsilon$-approximate equilibrium in the game where the generator has capacity $O(n T\log n/\epsilon^2 )$.</p>
</blockquote>
<p><em>Proof sketch:</em> A classical result in nonparametric statistics states that $P_{real}$ can be well-approximated by an <em>infinite</em> mixture of standard Gaussians. Now take a sample of size $O(n\log n/\epsilon^2)$ from this infinite mixture, and let $G$ be a uniform mixture on this finite sample of Gaussians. By an argument similar to Theorem 1, the distribution output by $G$ will be indistinguishable from $P_{real}$ by every deep net of capacity $n$. Finally, fold in this mixture of $O(n\log n/\epsilon^2)$ Gaussians into a single generator by using a small “selector” circuit that selects between these with the correct probability.</p>
<p>This theorem only shows <em>existence</em> of a particular equilibrium. What a GAN may actually find in practice using backpropagation is not addressed.</p>
<p>Finally, if we are interested in objectives other than Wasserstein GAN, then a similar proof can show the existence of an
$\epsilon$-approximate <em>mixed</em> equilibrium, namely, where the discriminator and generator are themselves small mixtures of
deep nets.</p>
<p><em>Aside:</em> The sampling idea in this proof goes back to <a href="http://dl.acm.org/citation.cfm?id=195447">Lipton and Young 1994</a>.
Similar ideas have also appeared in study of pseudorandomness (see <a href="http://ieeexplore.ieee.org/document/5231258/citations">Trevisan et al 2009</a>) and model criticism (see <a href="http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf">Gretton et al 2012</a>.)</p>
<h2 id="mix--gan-protocol">MIX + GAN protocol</h2>
<p>Our theory shows that using a mixture of (not too many) generators and discriminators guarantees existence of approximate equilibrium. This suggests that GANs training may be better and more stable we replace the simple generator and discriminator with mixtures of generators.</p>
<p>Of course, it is impractical to use very large mixtures, so we propose <strong>MIX + GAN</strong>: use a mixture of $k$ components, where $k$ is as large as allowed by size of GPU memory. Namely, train a mixture of $k$ generators ${G_{u_i}, i\in [k]}$ and $k$ discriminators ${D_{v_i}, i\in [k]}$). All components of the mixture share the same network architecture but have their own trainable parameters. Maintaining a mixture means of course maintaining a weight $w_{u_i}$ for the generator $G_{u_i}$ which corresponds to the probability of selecting the output of $G_{u_i}$. These weights are also updated via backpropagation. This heuristic can be combined with existing
methods like DC-GAN, W-GAN etc., giving us new training methods MIX + DC-GAN, MIX + W-GAN etc.</p>
<p>Some other experimental details: store the mixture probabilities using logarithm and use <a href="https://users.soe.ucsc.edu/~manfred/pubs/J36.pdf">exponentiated gradient</a> to update. Use an entropy regularizer to prevent collapse of the mixture to single component. All of these are theoretically justified if $k$ were very large, and are only heuristic when $k$ is as small as $5$.</p>
<p>We show that MIX+GAN can improve performance qualitatively (i.e., the images look better) and also quantitatively using popular measures such as Inception score.</p>
<div style="text-align:center;">
<img style="width:375px;" src="http://www.cs.princeton.edu/~arora/img/celeA_dcgan.png" /> $\quad$ <img style="width:375px;" src="http://www.cs.princeton.edu/~arora/img/celeA_mixgan.png" />
</div>
<p>Note that using a mixture increases the capacity of the model by a factor $k$, so it may not be entirely fair to compare the performance of MIX + X with X. On the other hand, in general it is not easy to get substantial performance benefit from increasing deep net capacity (in fact obvious ways of adding capacity that we tried actually reduced performance) whereas here the benefit happens out of the box.</p>
<p>Note that a mixture of generators or discriminators has been used in several recent works (cited in our paper), but we are not aware of any attempts to use a trainable mixture as above.</p>
<h2 id="take-away-lessons">Take-Away Lessons</h2>
<p>Complete understanding of GANs is challenging since we cannot even fully analyse simple backpropagation, let alone backpropagation combined with game-theoretic complications.</p>
<p>We therefore set aside issues of algorithmic convergence and focused on generalization and equilibrium, which focus on the maximum value of the objective. Our analysis suggests the following:</p>
<p>(a) Current GANs training uses a finite capacity deep net to distinguish between synthetic and real distributions. This training criterion by itself seems insufficient to ensure even good <em>diversity</em> in the synthetic distribution, let alone that it is actually very closes to $P_{real}$. (Theorem 1) A first step to fix this would be to focus on ways to ensure higher diversity, which is a necessary step towards ensuring $P_{synth} \approx P_{real}$.</p>
<p>(b) Our results seem to pose a conundrum about the GANs idea which I personally have not been able to resolve. Usually, we believe that adding capacity to the generator allows it gain representation power to model more fine-grained facts about the world and thus produce more realistic and diverse distributions. The downside to adding capacity is <em>overfitting</em>, which can be mitigated using more training images. Thus one imagines that the ideal configuration is:</p>
<blockquote>
<p>Number of training images > Generator capacity > Discriminator capacity.</p>
</blockquote>
<p>Theorem 1 suggests that if discriminator has capacity $n$ then it seems to derive very little benefit (at least in terms of the training objective) from a training set of more than $C (n\log n)/\epsilon^2$ images. Furthermore, there exist equilibria where the generator’s distribution is not too diverse.</p>
<p>So how can we change GANs training so that it ensures $P_{synth}$ having high diversity? Some possibilities are
(a) cap the generator capacity to be much below discriminator capacity. This might work but I don’t see a mathematical reason why. It certainly flies against the usual intuition that —so long as training dataset is large enough—more capacity allows generators to produce more realistic images. (b) high diversity results from some as-yet unknown property of back propagation algorithm (c) Change GANs setup in some other way.</p>
<p>At the very least our paper suggests that an explanation for good performance in GANs must draw upon some delicate interplay of the power of generator vs discriminator and the backpropagation algorithm. This fact was overlooked in previous analyses which assumed discriminators of infinite capacity.</p>
<p>(<em>I thank Moritz Hardt, Kunal Talwar, and Luca Trevisan for their comments and help with references.</em>)</p>
Thu, 30 Mar 2017 11:00:00 -0700
http://offconvex.github.io/2017/03/30/GANs2/
http://offconvex.github.io/2017/03/30/GANs2/Generative Adversarial Networks (GANs), Some Open Questions<p>Since ability to generate “realistic-looking” data may be a step towards understanding its structure and exploiting it, generative models are an important component of unsupervised learning, which has been a frequent theme on this blog. Today’s post is about Generative Adversarial Networks (GANs), introduced in 2014 by <a href="https://arxiv.org/abs/1406.2661">Goodfellow et al.</a>, which have quickly become very popular way to train generative models for complicated real-life data. It involves a game-theoretic tussle between a generator player and a discriminator player, which is very attractive and may be useful in other settings.</p>
<p>This post describes GANs and raises some open questions about them. The next post will describe <a href="https://arxiv.org/abs/1703.00573">our recent paper</a> addressing these questions.</p>
<p>A generative model $G$ can be seen as taking a random seed $h$ (say, a sample from a multivariate Normal distribution) and converting it into an output string $G(h)$ that “looks” like a real datapoint. Such models are popular in classical statistics but the simpler ones like Gaussian Mixtures or Dirichlet Processes seem insufficient for modeling complicated distributions on natural images or natural language. Generative models are also popular in statistical physics, e.g., Ising models and their cousins. These physics models migrated into machine learning and neuroscience in the 1980s and 1990s, which led to a new generative view of neural nets (e.g., Hinton’s <a href="https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">Restricted Boltzmann Machines</a>) which in turn led to multilayer generative models such as stacked denoising autoencoders and variational autoencoders. At their heart, these are nothing but multilayer neural nets that transform the random seed into an output that looks like a realistic image. The primary differences in the model concern details of training. Here is the obligatory set of generated images (source: <a href="https://openai.com/blog/generative-models/">OpenAI blog</a>)</p>
<div style="text-align:center;">
<img style="height=350px" src="https://openai.com/assets/research/generative-models/gans-2-6345b04cb02f720a95ea4cb9483e2fd5a5f6e46ec6ea5bbefadf002a010cda82.jpg" />
</div>
<h2 id="gans-the-basic-framework">GANs: The basic framework</h2>
<p>GANs also train a deep net $G$ to produce realistic images, but the new and beautiful twist lies in a novel training procedure.</p>
<p>To understand the new twist let’s first discuss what it could mean for the output to “look” realistic. A classic evaluation for generative models is <a href="https://en.wikipedia.org/wiki/Perplexity"><em>perplexity</em></a>: a measure of the amount of probability it gives to actual images. This requires that the generative model must be accompanied by an algorithm that computes the probability density function for the generated distribution (i.e., given any image, it must output an estimate of the probability that the model outputs this image.) I might do a future blog post discussing pros and cons of the perplexity measure, but today let’s instead dive straight to GANs, which sidestep the need for perplexity computations.</p>
<blockquote>
<p><strong>Idea 1:</strong> Since deep nets are good at recognizing images —e.g., distinguishing pictures of people from pictures of cats—why not let a deep net be the judge of the outputs of a generative model?</p>
</blockquote>
<p>More concretely, let $P_{real}$ be the distribution over real images, and $P_{synth}$ the one output by the model (i.e., the distribution of $G(h)$ when $h$ is a random seed). We could try to train a discriminator deep net $D$ that maps images to numbers in $[0,1]$ and tries to discriminate between these distributions in the following sense. Its
expected output $E_{x}[D(x)]$ as high as possible when $x$ is drawn from $P_{real}$ and
as low as possible when $x$ is drawn from $P_{synth}$. This training can be done with the <a href="http://www.offconvex.org/2016/12/20/backprop/">usual backpropagation</a>. If the two distributions are identical then of course no such deep net can exist, and so the training will end in failure. If on the other hand we are able to train a good discriminator deep net —one whose average output is noticeably different between real and synthetic samples— then this is proof positive that the two distributions are different. (There is an in-between case, whereby the distributions are different but the discriminator net doesn’t detect a difference. This is going to be important in the story in the next post.) A natural next question is whether the ability to train such a discriminator deep net can help us improve the generative model.</p>
<blockquote>
<p><strong>Idea 2:</strong> If a good discriminator net has been trained, use it to provide “gradient feedback” that improves the generative model.</p>
</blockquote>
<p>Let $G$ denote the Generator net, which means that samples in $P_{synth}$ are obtained by sampling a uniform gaussian seed $h$ and computing $G(h)$. The natural goal for the generator is to make $E_{h}[D(G(h))]$ as high as possible, because that means it is doing better at fooling the discriminator $D$. So if we fix $D$ the natural way to improve $G$ is to pick a few random seeds $h$, and slightly adjust the trainable parameters of $G$ to increase this objective. Note that this gradient computation involves backpropagation through the composed net $D(G(\cdot))$).</p>
<p>Of course, if we let the generator improve itself, it also makes sense to then let the discriminator improve itself too, Which leads to:</p>
<blockquote>
<p><strong>Idea 3:</strong> Turn the training of the generative model into a game of many moves or alternations.</p>
</blockquote>
<p>Each move for the discriminator consists of taking a few samples from $P_{real}$ and $P_{synth}$ and improving its ability to discriminate between them. Each move for the generator consists of producing a few samples from $P_{synth}$ and updating its parameters so that $E_{u}[D(G(h))]$ goes up a bit.</p>
<p>Notice, the discriminator always uses the generator as a black box —i.e., never examines its internal parameters —whereas the generator needs the discriminator’s parameters to compute its gradient direction. Also, the generator does not ever use real images from $P_{real}$ for its computation. (Though of course it does rely on the real images indirectly since the discriminator is trained using them.)</p>
<h2 id="gans-more-details">GANS: More details</h2>
<p>One can fill in the above framework in multiple ways. The most obvious is that the generator could try to maximize $E_{u}[f(D(G(h)))]$ where $f$ is some increasing function. (We call this the <em>measuring function.</em>) This has the effect of giving different importance to different samples. Goodfellow et al. originally used $f(x)=\log (x)$, which, since the derivative of $\log x$ is $1/x$, implicitly gives much more importance to synthetic data $G(u)$ where the discriminator outputs very low values $D(G(h))$. In other words, using $f(x) =\log x$ makes the training more sensitive to instances which the discriminator finds terrible than to instances which the discriminator finds so-so. By contrast, the above sketch implicitly used $f(x) =x$, which gives the same importance to all samples and appears in the recent <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a>.</p>
<p>The discussion thus leads to the following mathematical formulation, where $D, G$ are deep nets with specified architecture and whose number of parameters is fixed in advance by the algorithm designer.</p>
<script type="math/tex; mode=display">\min_{G} \max_{D}~~E_{x\sim P_{real}}[f(D(x))] + E_{h}[f(1-D(G(h)))]. \qquad (1)</script>
<p>There is now a big industry of improving this basic framework using various architectures and training variations, e.g. (a random sample; possibly missing some important ones): <a href="https://arxiv.org/abs/1511.06434v2">DC-GAN</a>, <a href="https://arxiv.org/abs/1612.04357">S-GAN</a>, <a href="https://arxiv.org/abs/1609.04802">SR-GAN</a>, <a href="https://arxiv.org/abs/1606.03657">INFO-GAN</a>, etc.</p>
<p>Usually, the training is continued until the generator wins, meaning the discriminator’s expected output on samples from $P_{real}$ and $P_{synth}$ becomes the same. But a serious practical difficulty is that training in practice is oscillatory, and the above objective is observed to go up and down. This is unlike usual deep net training, where training (at least in cases where it works) steadily improves the objective.</p>
<h2 id="gans-some-open-questions">GANS: Some open questions</h2>
<p>(a) <em>Does an equilibrium exist?</em></p>
<p>Since GAN is a 2-person game, the oscillatory behavior mentioned above is not unexpected. Just as a necessary condition for gradient descent to come to a stop is that the current point is a stationary point (ie gradient is zero), the corresponding situation in a 2-person game is an <em>equilibrium</em>: each player’s move happens to be its optimal response to the other’s move. In other words, switching the order of $\min$ and $\max$ in expression (1) doesn’t change the objective. The GAN formulation above needs a so-called pure equilibrium, which may not exist in general. A simple example is the classic rock/paper/scissors game. Regardless of whether one player plays rock, paper or scissor as a move, the other can counter with a move that beats it. Thus no pure equilibrium exists.</p>
<p>(b) <em>Does an equilibrium exist where the generator wins, i.e. discriminator ends up unable to distinguish the two distributions on finite samples?</em></p>
<p>(c) <em>Suppose the generator wins. What does this say about whether or not</em> $P_{real}$ <em>is close to</em> $P_{synth}$ ?</p>
<p>Question (c) has dogged GANs research from the start. Has the generative model actually learned something meaningful about real life images, or is it somehow memorizing existing images and presenting trivial modifications? (Recall that $G$ is never exposed directly to real images, so any “memorizing” has to be happen via the gradient propagated through the discriminator.)</p>
<p>If generator’s win does indeed say that $P_{real}$ and $P_{synth}$ are close then we think of the GANs training as <em>generalizing.</em> (This by analogy to the usual notion of generalization for supervised learning.)</p>
<p>In fact, the next post will show that this issue is indeed more subtle than hitherto recognized. But to complete the backstory I will summarize how this issue has been studied so far.</p>
<h2 id="past-efforts-at-understanding-generalization">Past efforts at understanding generalization</h2>
<p>The original paper of Goodfellow et al. introduced an analysis of generalization —adopted since by other researchers— that works when deep nets are trained “sufficiently high capacity, samples and training time” (to use their phrasing).</p>
<p>For the original objective function with $f(x) =\log x$ if the optimal discriminator is allowed to be any function all (i.e., not just one computable by a finite capacity neural net) it can be checked that the optimal choice is $D(x) = P_{real}(x)/(P_{real}(x)+P_{synth}(x))$.
Substituting this in the GANs objective, up to linear transformation the maximum value achieved by discriminator turns out to be
equivalent to the <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">Jensen-Shannon (JS) divergence</a> between $P_{real}$ and $P_{synth}$.
Hence if a generator wins the game against this <em>ideal</em> discriminator on a <em>very large</em> number of samples, then $P_{real}$ and $P_{synth}$ are close in JS divergence, and thus the model has learnt the true distribution.</p>
<p>A similar analysis for <a href="https://arxiv.org/abs/1701.07875">Wasserstein GANs</a> shows that if the generator wins using the Wasserstein objective (i.e., $f(x) =x$) then the two distributions are close in <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein or earth-mover distance</a>.</p>
<p>But we will see in the next post that these analyses can be misleading because in practice, deep nets have (very) finite capacity and
sample size. Thus even if training produces the optimal discriminator, the above analyses can be very off.</p>
<h2 id="further-resources">Further resources</h2>
<p>OpenAI has a <a href="https://openai.com/blog/generative-models/">brief survey of recent approaches</a> to generative models. The <a href="http://www.inference.vc/">inFERENCe blog</a> has many articles on GANs.</p>
<p><a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s survey</a> is the most authoritative account of this burgeoning field, and gives tons of insight. The text around Figure 22 discusses oscillation and lack of equilibria.
He also discusses how GANs trained on a broad spectrum of images seem to get confused and output images that are realistic at the micro level but nonsensical overall; e.g., an animal with a leg coming out of its head. Clearly this field, despite its promise, has many open questions!</p>
Wed, 15 Mar 2017 06:00:00 -0700
http://offconvex.github.io/2017/03/15/GANs/
http://offconvex.github.io/2017/03/15/GANs/Back-propagation, an introduction<p>Given the sheer number of backpropagation tutorials on the internet, is there really need for another? One of us (Sanjeev) recently taught backpropagation in <a href="https://www.cs.princeton.edu/courses/archive/fall16/cos402/">undergrad AI</a> and couldn’t find any account he was happy with. So here’s our exposition, together with some history and context, as well as a few advanced notions at the end. This article assumes the reader knows the definitions of gradients and neural networks.</p>
<h2 id="what-is-backpropagation">What is backpropagation?</h2>
<p>It is the basic algorithm in training neural nets, apparently independently rediscovered several times in the 1970-80’s (e.g., see Werbos’ <a href="https://www.researchgate.net/publication/35657389_Beyond_regression_new_tools_for_prediction_and_analysis_in_the_behavioral_sciences">Ph.D. thesis</a> and <a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471598976.html">book</a>, and <a href="http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html">Rumelhart et al.</a>). Some related ideas existed in control theory in the 1960s. (One reader points out another independent rediscovery, the Baur-Strassen lemma from 1983.)</p>
<p>Backpropagation gives a fast way to compute the sensitivity of the output of a neural network to all of its parameters while keeping the inputs of the network fixed: specifically it computes all partial derivatives ${\partial f}/{\partial w_i}$ where $f$ is the output and $w_i$ is the $i$th parameter. (Here <em>parameters</em> can be edge weights or biases associated with nodes or edges of the network, and the precise details of the node computations —e.g., the precise form of nonlinearity like Sigmoid or RELU— are unimportant.) Doing so gives the <em>gradient</em> $\nabla f$ of $f$ with respect to its network parameters, which allows a <em>gradient descent</em> step in the training: change all parameters simultaneously to move the vector of parameters a small amount in the direction $-\nabla f$.</p>
<p>Note that backpropagation computes the gradient exactly, but properly training neural nets needs many more tricks than just backpropagation. Understanding backpropagation is useful for appreciating some advanced tricks.</p>
<p>The importance of backpropagation derives from its efficiency. Assuming node operations take unit time, the running time is <em>linear</em>, specifically, $O(\text{Network Size}) = O(V + E)$, where $V$ is the number of nodes in the network and $E$ is the number of edges. The only technical ingredient is chain rule from calculus, but applying it naively would have resulted in quadratic running time—which would be hugely inefficient for networks with millions or even thousands of parameters.</p>
<p>Backpropagation can be efficiently implemented using highly parallel vector operations available in today’s <a href="https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units">GPUs (Graphical Processing Units)</a>, which play an important role in the the recent neural nets revolution.</p>
<p><strong>Side Note:</strong> Expert readers will recognize that in the standard accounts of neural net training,
the actual quantity of interest is the gradient of the <em>training loss</em>, which happens to be a simple function of the network output. But the above phrasing is fully general since one can simply add a new output node to the network that computes the training loss from the old output. Then the quantity of interest is indeed the gradient of this new output with respect to network parameters.</p>
<h2 id="problem-setup">Problem Setup</h2>
<p>Backpropagation applies only to acyclic networks with directed edges. (Later we briefly sketch its use on networks with cycles.)</p>
<p>Without loss of generality, acyclic networks can be visualized as being structured in numbered layers, with nodes in the $t+1$th layer getting all their inputs from the outputs of nodes in layers $t$ and earlier. We use $f \in \mathbb{R}$ to denote the output of the network. In all our figures, the input of the network is at the bottom and the output on the top.</p>
<p>We start with a simple claim that reduces the problem of computing the gradient to the problem of computing partial derivatives with respect to the nodes:</p>
<blockquote>
<p><strong>Claim 1:</strong> To compute the desired gradient with respect to the parameters, it suffices to compute $\partial f/\partial u$ for every node $u$.</p>
</blockquote>
<p>Let’s be clear what $\partial f/\partial u$ means. Suppose we cut off all the incoming edges of the node $u$, and fix/clamp the current values of all network parameters. Now imagine changing $u$ from its current value. This change may affect values of nodes at higher levels that are connected to $u$, and the final output $f$ is one such node. Then $\partial f/\partial u$ denotes the rate at which $f$ will change as we vary $u$. (Aside: Readers familiar with the usual exposition of back-propagation should note that there $f$ is the training error and this $\partial f/\partial u$ turns out to be exactly the “error” propagated back to on the node $u$.)</p>
<p>Claim 1 is a direct application of chain rule, and let’s illustrate it for a simple neural nets (we address more general networks later). Suppose node $u$ is a weighted sum of the nodes $z_1,\dots, z_m$ (which will be passed through a non-linear activation $\sigma$ afterwards). That is, we have $u = w_1z_1+\dots+w_nz_n$. By Chain rule, we have</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial w_1} = \frac{\partial f}{\partial u}\cdot \frac{\partial{u}}{\partial w_1} = \frac{\partial f}{\partial u}\cdot z_1.</script>
<p>Hence, we see that having computed $\partial f/\partial u$ we can compute $\partial f/\partial w_1$, and moreover this can be done locally by the endpoints of the edge where $w_1$ resides.</p>
<div style="text-align:center;">
<img style="width:500px;" src="http://ai.stanford.edu/~tengyuma/forblog/weight5.jpg" />
</div>
<h3 id="multivariate-chain-rule">Multivariate Chain Rule</h3>
<p>Towards computing the derivatives with respect to the nodes, we first recall the multivariate Chain rule, which handily describes the relationships between these partial derivatives (depending on the graph structure).</p>
<p>Suppose a variable $f$ is a function of variables $u_1,\dots, u_n$, which in turn depend on the variable $z$. Then, multivariate Chain rule says that</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial z} = \sum_{j=1}^n \frac{\partial f}{\partial u_j}\cdot \frac{\partial u_j}{\partial z}~.</script>
<p>This is a direct generalization of eqn. (2) and a sub-case of eqn. (11) in this <a href="http://mathworld.wolfram.com/ChainRule.html">description of chain rule</a>.</p>
<p>This formula is perfectly suitable for our cases. Below is the same example as we used before but with a different focus and numbering of the nodes.</p>
<div style="text-align:center;">
<img style="width:500px;" src="http://ai.stanford.edu/~tengyuma/forblog/chain_rule_5.jpg" />
</div>
<p>We see that given we’ve computed the derivatives with respect to all the nodes that is above the node $z$, we can compute the derivative with respect to the node $z$ via a weighted sum, where the weights involve the local derivative ${\partial u_j}/{\partial z}$ that is often easy to compute. This brings us to the question of how we measure running time. For book-keeping, we assume that</p>
<blockquote>
<p><strong>Basic assumption:</strong> If $u$ is a node at level $t+1$ and $z$ is any node at level $\leq t$ whose output is an input to $u$, then computing $\frac{\partial u}{\partial z}$ takes unit time on our computer.</p>
</blockquote>
<h3 id="naive-feedforward-algorithm-not-efficient">Naive feedforward algorithm (not efficient!)</h3>
<p>It is useful to first point out the naive quadratic time algorithm implied by the chain rule. Most authors skip this trivial version, which we think is analogous to teaching sorting using only quicksort, and skipping over the less efficient bubblesort.</p>
<p>The naive algorithm is to compute $\partial u_i/\partial u_j$ for every pair of nodes where $u_i$ is at a higher level than $u_j$. Of course, among these $V^2$ values (where $V$ is the number of nodes) are also the desired ${\partial f}/{\partial u_i}$ for all $i$ since $f$ is itself the value of the output node.</p>
<p>This computation can be done in feedforward fashion. If such value has been obtained for every $u_j$ on the level up to and including level $t$, then one can express (by inspecting the multivariate chain rule) the value $\partial u_{\ell}/\partial u_j$ for some $u_{\ell}$ at level $t+1$ as a weighted combination of values $\partial u_{i}/\partial u_j$ for each $u_i$ that is a direct input to $u_{\ell}$. This description shows that the amount of computation for a fixed $j$ is proportional to the number of edges $E$. This amount of work happens for all $V$ values of $j$, letting us conclude that the total work in the algorithm is $O(VE)$.</p>
<h2 id="backpropagation-linear-time">Backpropagation (Linear Time)</h2>
<p>The more efficient backpropagation, as the name suggests, computes the partial derivatives in the reverse direction. Messages are passed in one wave backwards from higher number layers to lower number layers. (Some presentations of the algorithm describe it as dynamic programming.)</p>
<blockquote>
<p><strong>Messaging protocol:</strong>
The node $u$ receives a message along each outgoing edge from the node at the other end of that edge. It sums these messages to get a number $S$ (if $u$ is the output of the entire net, then define $S=1$) and then it sends the following message to any node $z$ adjacent to it at a lower level:
<script type="math/tex">S \cdot \frac{\partial u}{\partial z}</script></p>
</blockquote>
<p>Clearly, the amount of work done by each node is proportional to its degree, and thus overall work is the sum of the node degrees. Summing all node degrees counts each edge twice, and thus the overall work is
$O(\text{Network Size})$.</p>
<p>To prove correctness, we prove the following:</p>
<blockquote>
<p><strong>Main Claim</strong>: At each node $z$, the value $S$ is exactly ${\partial f}/{\partial z}$.</p>
</blockquote>
<p><em>Base Case</em>: At the output layer this is true, since ${\partial f}/{\partial f} =1$.</p>
<p><em>Inductive case</em>: Suppose the claim was true for layers $t+1$ and higher and $u$ is at layer $t$, with outgoing edges go to some nodes $u_1, u_2, \ldots, u_m$ at levels $t+1$ or higher. By inductive hypothesis, node $z$ indeed receives $ \frac{\partial f}{\partial u_j}\times \frac{\partial u_j}{\partial z}$ from each of $u_j$. Thus by Chain rule,
<script type="math/tex">S= \sum_{i =1}^m \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial z}=\frac{\partial f}{\partial z}.</script>
This completes the induction and proves the Main Claim.</p>
<h2 id="auto-differentiation">Auto-differentiation</h2>
<p>Since the exposition above used almost no details about the network and the operations that the node perform, it extends to every computation that can be organized as an acyclic graph whose each node computes a differentiable function of its incoming neighbors. This observation underlies many auto-differentiation packages such as <a href="https://github.com/HIPS/autograd">autograd</a> or <a href="https://www.tensorflow.org/">tensorflow</a>: they allow computing the gradient of the output of such a computation with respect to the network parameters.</p>
<p>We first observe that Claim 1 continues to hold in this very general setting. This is without loss of generality because we can view the parameters associated to the edges as also sitting on the nodes (actually, leaf nodes). This can be done via a simple transformation to the network; for a single node it is shown in the picture below; and one would need to continue to do this transformation in the rest of the networks feeding into $u_1, u_2,..$ etc from below.</p>
<div style="text-align:center;">
<img style="width:800px;" src="http://ai.stanford.edu/~tengyuma/forblog/change_view" />
</div>
<p>Then, we can use the messaging protocol to compute the derivatives with respect to the nodes, as long as the local partial derivative can be computed efficiently. We note that the algorithm can be implemented in a fairly modular manner: For every node $u$, it suffices to specify (a) how it depends on the incoming nodes, say, $z_1,\dots, z_n$ and (b) how to compute the partial derivative times $S$, that is, $S \cdot \frac{\partial u}{\partial z_j}$.</p>
<p><em>Extension to vector messages</em>: In fact (b) can be done efficiently in more general settings where we allow the output of each node in the network to be a vector (or even matrix/tensor) instead of only a real number. Here we need to replace $\frac{\partial u}{\partial z_j}\cdot S$ by $\frac{\partial u}{\partial z_j}[S]$, which denotes the result of applying the operator $\frac{\partial u}{\partial z_j}$ on $S$. We note that to be consistent with the convention in the usual exposition of backpropagation, when $y\in \mathbb{R}^{p}$ is a funciton of $x\in \mathbb{R}^q$, we use $\frac{\partial y}{\partial x}$ to denote $q\times p$ dimensional matrix with $\partial y_j/\partial x_i$ as the $(i,j)$-th entry. Readers might notice that this is the transpose of the usual Jacobian matrix defined in mathematics. Thus $\frac{\partial y}{\partial x}$ is an operator that maps $\mathbb{R}^p$ to $\mathbb{R}^q$ and we can verify $S$ has the same dimension as $u$ and $\frac{\partial u}{\partial z_j}[S]$ has the same dimension as $z_j$.</p>
<p>For example, as illustrated below, suppose the node $U\in \mathbb{R}^{d_1\times d_3} $ is a product of two matrices $W\in \mathbb{R}^{d_2\times d_3}$ and $Z\in \mathbb{R}^{d_1\times d_2}$. Then we have that $\partial U/\partial Z$ is a linear operator that maps $\mathbb{R}^{d_2\times d_3}$ to $\mathbb{R}^{d_1\times d_3}$, which naively requires a matrix representation of dimension $d_2d_3\times d_1d_3$. However, the computation (b) can be done efficiently because
<script type="math/tex">\frac{\partial U}{\partial Z}[S]= W^{\top}S.</script></p>
<p>Such vector operations can also be implemented efficiently using today’s GPUs.</p>
<div style="text-align:center;">
<img style="width:200px;" src="http://ai.stanford.edu/~tengyuma/forblog/mult.jpg" />
</div>
<h2 id="notable-extensions">Notable Extensions</h2>
<p>1) <em>Allowing weight tying.</em> In many neural architectures, the designer wants to force many network units such as edges or nodes to share the same parameter. For example, in <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network"><em>convolutional neural nets</em></a>, the same filter has to be applied all over the image, which implies reusing the same parameter for a large set of edges between the two layers.</p>
<p>For simplicity, suppose two parameters $a$ and $b$ are supposed to share the same value. This is equivalent to adding a new node $u$ and connecting $u$ to both $a$ and $b$ with the operation $a = u$ and $b=u$. Thus, by chain rule, <script type="math/tex">\frac{\partial f}{\partial u} = \frac{\partial f}{\partial a}\cdot \frac{\partial a}{\partial u}+\frac{\partial f}{\partial b}\cdot \frac{\partial b}{\partial u} = \frac{\partial f}{\partial a}+ \frac{\partial f}{\partial b}.</script> Hence, equivalently, the gradient with respect to a shared parameter is the sum of the gradients with respect to individual occurrences.</p>
<p>2) <em>Backpropagation on networks with loops.</em> The above exposition assumed the network is acyclic. Many cutting-edge applications such as machine translation and language understanding use networks with directed loops (e.g., recurrent neural networks). These architectures —all examples of the “differentiable computing” paradigm below—can get complicated and may involve operations on a separate memory as well as mechanisms to shift attention to different parts of data and memory.</p>
<p>Networks with loops are trained using gradient descent as well, using <a href="https://en.wikipedia.org/wiki/Backpropagation_through_time">back-propagation through time</a>, which consists of expanding the network through a finite number of time steps into an acyclic graph, with replicated copies of the same
network. These replicas share the weights (weight tying!) so the gradient can be computed. In practice an issue may arise with <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">exploding or vanishing gradients</a> which impact convergence. Such issues can be carefully addressed in practice by clipping the gradient or re-parameterization techniques such as <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">long short-term memory</a>.</p>
<p>The fact that the gradient can be computed efficiently for such general networks with loops has motivated neural net models with memory or even data structures (see for example <a href="https://en.wikipedia.org/wiki/Neural_Turing_machine">neural Turing machines</a> and <a href="https://en.wikipedia.org/wiki/Differentiable_neural_computer">differentiable neural computer</a>). Using gradient descent, one can optimize over a family of parameterized networks with loops to find the best one that solves a certain computational task (on the training examples). The limits of these ideas are still being explored.</p>
<p>3) <em>Hessian-vector product in linear time.</em> It is possible to generalize backprop to enable 2nd order optimization in “near-linear” time, not just gradient descent,
as shown in recent independent manuscripts of <a href="https://arxiv.org/pdf/1611.00756.pdf">Carmon et al.</a> and <a href="https://arxiv.org/pdf/1611.01146.pdf">Agarwal et al.</a> (NB: Tengyu is a coauthor on this one.). One essential step is to
compute the product of the <a href="https://en.wikipedia.org/wiki/Hessian_matrix">Hessian matrix</a> and a vector, for which <a href="http://www.bcl.hamilton.ie/~barak/papers/nc-hessian.pdf">Pearlmutter’93</a> gave an efficient algorithm. Here we show how to do this in $O(\mbox{Network size})$ using the ideas above. We need a slightly stronger version of the back-propagation result than the one in the previous subsection:</p>
<blockquote>
<p><strong>Claim (informal):</strong> Suppose an acyclic network with $V$ nodes and $E$ edges has output $f$ and leaves $z_1,\dots, z_m$. Then there exists a network of size $O(V+E)$ that has $z_1,\dots, z_m$ as input nodes and $\frac{\partial f}{\partial z_1},\dots, \frac{\partial f}{\partial z_m}$ as output nodes.</p>
</blockquote>
<p>The proof of the Claim follows in straightforward fashion from implementing the message passing protocol as an acyclic circuit.</p>
<p>Next we show how to compute $\nabla^2 f(z)\cdot v$ where $v$ is a given fixed vector. Let $g(z)= \langle \nabla f(z),v\rangle$ be a function from $\mathbb{R}^d\rightarrow \mathbb{R}$. Then by the Claim above, $g(z)$ can be computed by a network of size $O(V+E)$. Now apply the Claim again on $g(z)$, we obtain that $\nabla g(z)$ can also be computed by a network of size $O(V+E)$.</p>
<p>Note that by construction,
<script type="math/tex">\nabla g(z) = \nabla^2 f(z)\cdot v.</script>
Hence we have computed the Hessian vector product in network size time.</p>
<p>##That’s all!</p>
<p>Please write your comments on this exposition and whether it can be improved.</p>
Tue, 20 Dec 2016 09:00:00 -0800
http://offconvex.github.io/2016/12/20/backprop/
http://offconvex.github.io/2016/12/20/backprop/