Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Generalization and Equilibrium in Generative Adversarial Networks (GANs)<p>The <a href="http://www.offconvex.org/2017/03/15/GANs/">previous post</a> described Generative Adversarial Networks (GANs), a technique for training generative models for image distributions (and other complicated distributions) via a 2-party game between a generator deep net and a discriminator deep net. This post describes <a href="https://arxiv.org/abs/1703.00573">my new paper</a> with Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. We address some fundamental issues about generalization in GANs that have been debated since the beginning; e.g., what is the sense in which the learnt distribution close to the target distribution, and also what kind of equilibrium exists between generator and discriminator.</p>
<p>The usual analysis of GANs, sketched in the previous post, assumes “sufficiently large number of samples and sufficiently large discriminator nets” to conclude that at the end of training the learnt distribution should be close to the target distribution. Our new analysis, which accounts for the finite capacity of the discriminator net, calls this into question.</p>
<p>Readers looking for new implementation ideas can skip ahead to the section below on our <em>Mix + GAN</em> protocol. It takes other GANs codes as black box and (by adding extra capacity and corresponding training time) often improves the learnt distribution in qualitative and quantitative measures. Our testing suggests that it works well out of the box.</p>
<p><strong>Notation</strong> Assume images are represented as vectors in $\Re^d$. Typically $d$ would be $1000$ or much higher. The <em>capacity</em> of the discriminator, namely, the number of trainable parameters, is denoted $n$. The distribution on all real-life images is denoted $P_{real}$. We assume that the number of distinct images in
$P_{real}$ —regardless of how one defines “distinct”—is enormous compared to all these parameters.</p>
<p>Recall that the discriminator $D$ is trained to distinguish between samples from $P_{real}$ and samples from the generator’s distribution
$P_{synth}$. This can be formalized using different measures (leading to different GANs objectives) and for simplicity our exposition here uses the <em>distinguishing probability</em> which is used in <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a> objective:</p>
<script type="math/tex; mode=display">|~E_{x \in P_{real}}[D(x)] - E_{x \in P_{synth}}[D(x)]|\qquad (1).</script>
<p>(Readers with a background in theoretical CS and cryptography will be reminded of similar definitions in theory of pseudorandom generators.)</p>
<h2 id="finite-discriminators-have-limited-power">Finite Discriminators have limited power</h2>
<p>The following simple fact shows why ignoring the discriminator’s capacity constraint can lead to grossly incorrect intuition.
(The constant $C$ is fairly small and explained in the paper.)</p>
<blockquote>
<p><strong>Theorem 1</strong> Suppose the discriminator has capacity $n$. Then expression (1) is less than $\epsilon$ when $P_{synth}$ is the following: uniform distribution on a random sample of $C n/\epsilon^2 \log n$ images from $P_{real}$.</p>
</blockquote>
<p>Note that this theorem is not a formalization of the usual failure mode discussed in GANs literature, whereby the generator simply memorizes the training images. The theorem still applies if we allow the discriminator to use a large set of held out images from $P_{real}$, which are completely different than images in $P_{synth}$. Or, if the training set of images is much larger than $C n/\epsilon^2 \log n$ images. Furthermore, common measures of diversity/novelty used in research papers (e.g., pick a random image from $P_{synth}$ and check the “distance” to the nearest neighbor among the training set) are not guaranteed to get around the problem raised by Theorem 1.</p>
<p>Since $Cn/\epsilon^2\log n$ is rather small, this theorem says that a finite-capacity discriminator is unable to even enforce that $P_{synth}$ has large <em>diversity</em>, let alone enforce that $P_{synth}\approx P_{real}$. The theorem does not imply that existing GANs do not <em>in practice</em> generate diverse distributions; merely that the current analyses gives no reason to believe that they do so.</p>
<p>The proof of the Theorem is a standard sampling argument from learning theory: take an $\epsilon$-net in the continuum of all deep nets of capacity $n$ and a fixed architecture, and do a union bound. Please see the paper for details. (Aside: the “$\epsilon^2$” term in Theorem 1 arises from this argument, and is ubiquitous in ML theory.)</p>
<p>Motivated by this theorem, we argue in the paper that the correct way to think about generalization for GANs is not the usual distance functions between distributions such as Jensen-Shannon or Wasserstein, but a new distance we define called <em>Neural net distance</em>. The neural net distance measures the ability of finite-capacity deep nets to distinguish the distributions. It can be small even when the other distances are large (as illustrated in the above theorem).</p>
<h3 id="corollary-larger-training-sets-have-limited-utility">Corollary: Larger training sets have limited utility</h3>
<p>In fact theorem 1 has the following rephrasing. Suppose we have a very large training set of images. If the discriminator has capacity $n$, then it suffices to take a subsample of size $C n/\epsilon^2 \log n$ from this training set, and we are guaranteed that GANs training using this subsample is capable of achieving a training objective that is within $\epsilon$ of the best achieved with the full training set. ( Any more samples can only improve the training objective by at most $\epsilon$. )</p>
<h2 id="existence-of-equilibrium">Existence of Equilibrium</h2>
<p>Let’s recall the objective used in GAN (for simplicity, we again stick with the Wasserstein GAN):</p>
<p><script type="math/tex">\min_{G} \max_{D}~~E_{x\sim P_{real}}[D(x)] - E_{h}[D(G(h))] \qquad (1)</script>
where $G$ is the generator net, and $P_{synth}$ is the distribution of $G(h)$ where $h$ is a random seed. Researchers have noted that this is implicitly a $2$-person game and it may not have an equilibrium; e.g., see the discussion around Figure 22 in <a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s survey</a>. An <em>equilibrium</em> corresponds to a $G$ and a $D$ such that the pair are still a solution if we switch the order of min and max in (1). (That is, $G$ doesn’t have incentive to switch in response to $D$, and vice versa.) Lack of equilibrium may be a cause of oscillatory behavior observed in training.</p>
<p>But ideally we wish to show something stronger than mere <em>existence</em> of equilibrium: we wish to exhibit an equilibrium where the generator <em>wins</em>, with the objective above at zero or close to zero (in other words, discriminator is unable to distinguish between the two distributions).</p>
<p>We will prove existence of an $\epsilon$-<em>approximate equilibrium</em>, whereby switching the order of $G, D$ affects the expression by at most $\epsilon$. (That is, $G$ has only limited incentive to switch in respnse to $D$ and vice versa.) Naively one would imagine that proving such a result involves some insight into the distribution $P_{real}$ but surprisingly none is needed.</p>
<blockquote>
<p><strong>Theorem 2</strong> If a generator net of capacity $T$ is able to generate a Gaussian distribution in $\Re^d$, then there exists an $\epsilon$-approximate equilibrium in the game where the generator has capacity $O(n T\log n/\epsilon^2 )$.</p>
</blockquote>
<p><em>Proof sketch:</em> A classical result in nonparametric statistics states that $P_{real}$ can be well-approximated by an <em>infinite</em> mixture of standard Gaussians. Now take a sample of size $O(n\log n/\epsilon^2)$ from this infinite mixture, and let $G$ be a uniform mixture on this finite sample of Gaussians. By an argument similar to Theorem 1, the distribution output by $G$ will be indistinguishable from $P_{real}$ by every deep net of capacity $n$. Finally, fold in this mixture of $O(n\log n/\epsilon^2)$ Gaussians into a single generator by using a small “selector” circuit that selects between these with the correct probability.</p>
<p>This theorem only shows <em>existence</em> of a particular equilibrium. What a GAN may actually find in practice using backpropagation is not addressed.</p>
<p>Finally, if we are interested in objectives other than Wasserstein GAN, then a similar proof can show the existence of an
$\epsilon$-approximate <em>mixed</em> equilibrium, namely, where the discriminator and generator are themselves small mixtures of
deep nets.</p>
<p><em>Aside:</em> The sampling idea in this proof goes back to <a href="http://dl.acm.org/citation.cfm?id=195447">Lipton and Young 1994</a>.
Similar ideas have also appeared in study of pseudorandomness (see <a href="http://ieeexplore.ieee.org/document/5231258/citations">Trevisan et al 2009</a>) and model criticism (see <a href="http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf">Gretton et al 2012</a>.)</p>
<h2 id="mix--gan-protocol">MIX + GAN protocol</h2>
<p>Our theory shows that using a mixture of (not too many) generators and discriminators guarantees existence of approximate equilibrium. This suggests that GANs training may be better and more stable we replace the simple generator and discriminator with mixtures of generators.</p>
<p>Of course, it is impractical to use very large mixtures, so we propose <strong>MIX + GAN</strong>: use a mixture of $k$ components, where $k$ is as large as allowed by size of GPU memory. Namely, train a mixture of $k$ generators ${G_{u_i}, i\in [k]}$ and $k$ discriminators ${D_{v_i}, i\in [k]}$). All components of the mixture share the same network architecture but have their own trainable parameters. Maintaining a mixture means of course maintaining a weight $w_{u_i}$ for the generator $G_{u_i}$ which corresponds to the probability of selecting the output of $G_{u_i}$. These weights are also updated via backpropagation. This heuristic can be combined with existing
methods like DC-GAN, W-GAN etc., giving us new training methods MIX + DC-GAN, MIX + W-GAN etc.</p>
<p>Some other experimental details: store the mixture probabilities using logarithm and use <a href="https://users.soe.ucsc.edu/~manfred/pubs/J36.pdf">exponentiated gradient</a> to update. Use an entropy regularizer to prevent collapse of the mixture to single component. All of these are theoretically justified if $k$ were very large, and are only heuristic when $k$ is as small as $5$.</p>
<p>We show that MIX+GAN can improve performance qualitatively (i.e., the images look better) and also quantitatively using popular measures such as Inception score.</p>
<div style="text-align:center;">
<img style="width:375px;" src="http://www.cs.princeton.edu/~arora/img/celeA_dcgan.png" /> $\quad$ <img style="width:375px;" src="http://www.cs.princeton.edu/~arora/img/celeA_mixgan.png" />
</div>
<p>Note that using a mixture increases the capacity of the model by a factor $k$, so it may not be entirely fair to compare the performance of MIX + X with X. On the other hand, in general it is not easy to get substantial performance benefit from increasing deep net capacity (in fact obvious ways of adding capacity that we tried actually reduced performance) whereas here the benefit happens out of the box.</p>
<p>Note that a mixture of generators or discriminators has been used in several recent works (cited in our paper), but we are not aware of any attempts to use a trainable mixture as above.</p>
<h2 id="take-away-lessons">Take-Away Lessons</h2>
<p>Complete understanding of GANs is challenging since we cannot even fully analyse simple backpropagation, let alone backpropagation combined with game-theoretic complications.</p>
<p>We therefore set aside issues of algorithmic convergence and focused on generalization and equilibrium, which focus on the maximum value of the objective. Our analysis suggests the following:</p>
<p>(a) Current GANs training uses a finite capacity deep net to distinguish between synthetic and real distributions. This training criterion by itself seems insufficient to ensure even good <em>diversity</em> in the synthetic distribution, let alone that it is actually very closes to $P_{real}$. (Theorem 1) A first step to fix this would be to focus on ways to ensure higher diversity, which is a necessary step towards ensuring $P_{synth} \approx P_{real}$.</p>
<p>(b) Our results seem to pose a conundrum about the GANs idea which I personally have not been able to resolve. Usually, we believe that adding capacity to the generator allows it gain representation power to model more fine-grained facts about the world and thus produce more realistic and diverse distributions. The downside to adding capacity is <em>overfitting</em>, which can be mitigated using more training images. Thus one imagines that the ideal configuration is:</p>
<blockquote>
<p>Number of training images > Generator capacity > Discriminator capacity.</p>
</blockquote>
<p>Theorem 1 suggests that if discriminator has capacity $n$ then it seems to derive very little benefit (at least in terms of the training objective) from a training set of more than $C (n\log n)/\epsilon^2$ images. Furthermore, there exist equilibria where the generator’s distribution is not too diverse.</p>
<p>So how can we change GANs training so that it ensures $P_{synth}$ having high diversity? Some possibilities are
(a) cap the generator capacity to be much below discriminator capacity. This might work but I don’t see a mathematical reason why. It certainly flies against the usual intuition that —so long as training dataset is large enough—more capacity allows generators to produce more realistic images. (b) high diversity results from some as-yet unknown property of back propagation algorithm (c) Change GANs setup in some other way.</p>
<p>At the very least our paper suggests that an explanation for good performance in GANs must draw upon some delicate interplay of the power of generator vs discriminator and the backpropagation algorithm. This fact was overlooked in previous analyses which assumed discriminators of infinite capacity.</p>
<p>(<em>I thank Moritz Hardt, Kunal Talwar, and Luca Trevisan for their comments and help with references.</em>)</p>
Thu, 30 Mar 2017 11:00:00 -0700
http://offconvex.github.io/2017/03/30/GANs2/
http://offconvex.github.io/2017/03/30/GANs2/Generative Adversarial Networks (GANs), Some Open Questions<p>Since ability to generate “realistic-looking” data may be a step towards understanding its structure and exploiting it, generative models are an important component of unsupervised learning, which has been a frequent theme on this blog. Today’s post is about Generative Adversarial Networks (GANs), introduced in 2014 by <a href="https://arxiv.org/abs/1406.2661">Goodfellow et al.</a>, which have quickly become very popular way to train generative models for complicated real-life data. It involves a game-theoretic tussle between a generator player and a discriminator player, which is very attractive and may be useful in other settings.</p>
<p>This post describes GANs and raises some open questions about them. The next post will describe <a href="https://arxiv.org/abs/1703.00573">our recent paper</a> addressing these questions.</p>
<p>A generative model $G$ can be seen as taking a random seed $h$ (say, a sample from a multivariate Normal distribution) and converting it into an output string $G(h)$ that “looks” like a real datapoint. Such models are popular in classical statistics but the simpler ones like Gaussian Mixtures or Dirichlet Processes seem insufficient for modeling complicated distributions on natural images or natural language. Generative models are also popular in statistical physics, e.g., Ising models and their cousins. These physics models migrated into machine learning and neuroscience in the 1980s and 1990s, which led to a new generative view of neural nets (e.g., Hinton’s <a href="https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">Restricted Boltzmann Machines</a>) which in turn led to multilayer generative models such as stacked denoising autoencoders and variational autoencoders. At their heart, these are nothing but multilayer neural nets that transform the random seed into an output that looks like a realistic image. The primary differences in the model concern details of training. Here is the obligatory set of generated images (source: <a href="https://openai.com/blog/generative-models/">OpenAI blog</a>)</p>
<div style="text-align:center;">
<img style="height=350px" src="https://openai.com/assets/research/generative-models/gans-2-6345b04cb02f720a95ea4cb9483e2fd5a5f6e46ec6ea5bbefadf002a010cda82.jpg" />
</div>
<h2 id="gans-the-basic-framework">GANs: The basic framework</h2>
<p>GANs also train a deep net $G$ to produce realistic images, but the new and beautiful twist lies in a novel training procedure.</p>
<p>To understand the new twist let’s first discuss what it could mean for the output to “look” realistic. A classic evaluation for generative models is <a href="https://en.wikipedia.org/wiki/Perplexity"><em>perplexity</em></a>: a measure of the amount of probability it gives to actual images. This requires that the generative model must be accompanied by an algorithm that computes the probability density function for the generated distribution (i.e., given any image, it must output an estimate of the probability that the model outputs this image.) I might do a future blog post discussing pros and cons of the perplexity measure, but today let’s instead dive straight to GANs, which sidestep the need for perplexity computations.</p>
<blockquote>
<p><strong>Idea 1:</strong> Since deep nets are good at recognizing images —e.g., distinguishing pictures of people from pictures of cats—why not let a deep net be the judge of the outputs of a generative model?</p>
</blockquote>
<p>More concretely, let $P_{real}$ be the distribution over real images, and $P_{synth}$ the one output by the model (i.e., the distribution of $G(h)$ when $h$ is a random seed). We could try to train a discriminator deep net $D$ that maps images to numbers in $[0,1]$ and tries to discriminate between these distributions in the following sense. Its
expected output $E_{x}[D(x)]$ as high as possible when $x$ is drawn from $P_{real}$ and
as low as possible when $x$ is drawn from $P_{synth}$. This training can be done with the <a href="http://www.offconvex.org/2016/12/20/backprop/">usual backpropagation</a>. If the two distributions are identical then of course no such deep net can exist, and so the training will end in failure. If on the other hand we are able to train a good discriminator deep net —one whose average output is noticeably different between real and synthetic samples— then this is proof positive that the two distributions are different. (There is an in-between case, whereby the distributions are different but the discriminator net doesn’t detect a difference. This is going to be important in the story in the next post.) A natural next question is whether the ability to train such a discriminator deep net can help us improve the generative model.</p>
<blockquote>
<p><strong>Idea 2:</strong> If a good discriminator net has been trained, use it to provide “gradient feedback” that improves the generative model.</p>
</blockquote>
<p>Let $G$ denote the Generator net, which means that samples in $P_{synth}$ are obtained by sampling a uniform gaussian seed $h$ and computing $G(h)$. The natural goal for the generator is to make $E_{h}[D(G(h))]$ as high as possible, because that means it is doing better at fooling the discriminator $D$. So if we fix $D$ the natural way to improve $G$ is to pick a few random seeds $h$, and slightly adjust the trainable parameters of $G$ to increase this objective. Note that this gradient computation involves backpropagation through the composed net $D(G(\cdot))$).</p>
<p>Of course, if we let the generator improve itself, it also makes sense to then let the discriminator improve itself too, Which leads to:</p>
<blockquote>
<p><strong>Idea 3:</strong> Turn the training of the generative model into a game of many moves or alternations.</p>
</blockquote>
<p>Each move for the discriminator consists of taking a few samples from $P_{real}$ and $P_{synth}$ and improving its ability to discriminate between them. Each move for the generator consists of producing a few samples from $P_{synth}$ and updating its parameters so that $E_{u}[D(G(h))]$ goes up a bit.</p>
<p>Notice, the discriminator always uses the generator as a black box —i.e., never examines its internal parameters —whereas the generator needs the discriminator’s parameters to compute its gradient direction. Also, the generator does not ever use real images from $P_{real}$ for its computation. (Though of course it does rely on the real images indirectly since the discriminator is trained using them.)</p>
<h2 id="gans-more-details">GANS: More details</h2>
<p>One can fill in the above framework in multiple ways. The most obvious is that the generator could try to maximize $E_{u}[f(D(G(h)))]$ where $f$ is some increasing function. (We call this the <em>measuring function.</em>) This has the effect of giving different importance to different samples. Goodfellow et al. originally used $f(x)=\log (x)$, which, since the derivative of $\log x$ is $1/x$, implicitly gives much more importance to synthetic data $G(u)$ where the discriminator outputs very low values $D(G(h))$. In other words, using $f(x) =\log x$ makes the training more sensitive to instances which the discriminator finds terrible than to instances which the discriminator finds so-so. By contrast, the above sketch implicitly used $f(x) =x$, which gives the same importance to all samples and appears in the recent <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a>.</p>
<p>The discussion thus leads to the following mathematical formulation, where $D, G$ are deep nets with specified architecture and whose number of parameters is fixed in advance by the algorithm designer.</p>
<script type="math/tex; mode=display">\min_{G} \max_{D}~~E_{x\sim P_{real}}[f(D(x))] + E_{h}[f(1-D(G(h)))]. \qquad (1)</script>
<p>There is now a big industry of improving this basic framework using various architectures and training variations, e.g. (a random sample; possibly missing some important ones): <a href="https://arxiv.org/abs/1511.06434v2">DC-GAN</a>, <a href="https://arxiv.org/abs/1612.04357">S-GAN</a>, <a href="https://arxiv.org/abs/1609.04802">SR-GAN</a>, <a href="https://arxiv.org/abs/1606.03657">INFO-GAN</a>, etc.</p>
<p>Usually, the training is continued until the generator wins, meaning the discriminator’s expected output on samples from $P_{real}$ and $P_{synth}$ becomes the same. But a serious practical difficulty is that training in practice is oscillatory, and the above objective is observed to go up and down. This is unlike usual deep net training, where training (at least in cases where it works) steadily improves the objective.</p>
<h2 id="gans-some-open-questions">GANS: Some open questions</h2>
<p>(a) <em>Does an equilibrium exist?</em></p>
<p>Since GAN is a 2-person game, the oscillatory behavior mentioned above is not unexpected. Just as a necessary condition for gradient descent to come to a stop is that the current point is a stationary point (ie gradient is zero), the corresponding situation in a 2-person game is an <em>equilibrium</em>: each player’s move happens to be its optimal response to the other’s move. In other words, switching the order of $\min$ and $\max$ in expression (1) doesn’t change the objective. The GAN formulation above needs a so-called pure equilibrium, which may not exist in general. A simple example is the classic rock/paper/scissors game. Regardless of whether one player plays rock, paper or scissor as a move, the other can counter with a move that beats it. Thus no pure equilibrium exists.</p>
<p>(b) <em>Does an equilibrium exist where the generator wins, i.e. discriminator ends up unable to distinguish the two distributions on finite samples?</em></p>
<p>(c) <em>Suppose the generator wins. What does this say about whether or not</em> $P_{real}$ <em>is close to</em> $P_{synth}$ ?</p>
<p>Question (c) has dogged GANs research from the start. Has the generative model actually learned something meaningful about real life images, or is it somehow memorizing existing images and presenting trivial modifications? (Recall that $G$ is never exposed directly to real images, so any “memorizing” has to be happen via the gradient propagated through the discriminator.)</p>
<p>If generator’s win does indeed say that $P_{real}$ and $P_{synth}$ are close then we think of the GANs training as <em>generalizing.</em> (This by analogy to the usual notion of generalization for supervised learning.)</p>
<p>In fact, the next post will show that this issue is indeed more subtle than hitherto recognized. But to complete the backstory I will summarize how this issue has been studied so far.</p>
<h2 id="past-efforts-at-understanding-generalization">Past efforts at understanding generalization</h2>
<p>The original paper of Goodfellow et al. introduced an analysis of generalization —adopted since by other researchers— that works when deep nets are trained “sufficiently high capacity, samples and training time” (to use their phrasing).</p>
<p>For the original objective function with $f(x) =\log x$ if the optimal discriminator is allowed to be any function all (i.e., not just one computable by a finite capacity neural net) it can be checked that the optimal choice is $D(x) = P_{real}(x)/(P_{real}(x)+P_{synth}(x))$.
Substituting this in the GANs objective, up to linear transformation the maximum value achieved by discriminator turns out to be
equivalent to the <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">Jensen-Shannon (JS) divergence</a> between $P_{real}$ and $P_{synth}$.
Hence if a generator wins the game against this <em>ideal</em> discriminator on a <em>very large</em> number of samples, then $P_{real}$ and $P_{synth}$ are close in JS divergence, and thus the model has learnt the true distribution.</p>
<p>A similar analysis for <a href="https://arxiv.org/abs/1701.07875">Wasserstein GANs</a> shows that if the generator wins using the Wasserstein objective (i.e., $f(x) =x$) then the two distributions are close in <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein or earth-mover distance</a>.</p>
<p>But we will see in the next post that these analyses can be misleading because in practice, deep nets have (very) finite capacity and
sample size. Thus even if training produces the optimal discriminator, the above analyses can be very off.</p>
<h2 id="further-resources">Further resources</h2>
<p>OpenAI has a <a href="https://openai.com/blog/generative-models/">brief survey of recent approaches</a> to generative models. The <a href="http://www.inference.vc/">inFERENCe blog</a> has many articles on GANs.</p>
<p><a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s survey</a> is the most authoritative account of this burgeoning field, and gives tons of insight. The text around Figure 22 discusses oscillation and lack of equilibria.
He also discusses how GANs trained on a broad spectrum of images seem to get confused and output images that are realistic at the micro level but nonsensical overall; e.g., an animal with a leg coming out of its head. Clearly this field, despite its promise, has many open questions!</p>
Wed, 15 Mar 2017 06:00:00 -0700
http://offconvex.github.io/2017/03/15/GANs/
http://offconvex.github.io/2017/03/15/GANs/Back-propagation, an introduction<p>Given the sheer number of backpropagation tutorials on the internet, is there really need for another? One of us (Sanjeev) recently taught backpropagation in <a href="https://www.cs.princeton.edu/courses/archive/fall16/cos402/">undergrad AI</a> and couldn’t find any account he was happy with. So here’s our exposition, together with some history and context, as well as a few advanced notions at the end. This article assumes the reader knows the definitions of gradients and neural networks.</p>
<h2 id="what-is-backpropagation">What is backpropagation?</h2>
<p>It is the basic algorithm in training neural nets, apparently independently rediscovered several times in the 1970-80’s (e.g., see Werbos’ <a href="https://www.researchgate.net/publication/35657389_Beyond_regression_new_tools_for_prediction_and_analysis_in_the_behavioral_sciences">Ph.D. thesis</a> and <a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471598976.html">book</a>, and <a href="http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html">Rumelhart et al.</a>). Some related ideas existed in control theory in the 1960s. (One reader points out another independent rediscovery, the Baur-Strassen lemma from 1983.)</p>
<p>Backpropagation gives a fast way to compute the sensitivity of the output of a neural network to all of its parameters while keeping the inputs of the network fixed: specifically it computes all partial derivatives ${\partial f}/{\partial w_i}$ where $f$ is the output and $w_i$ is the $i$th parameter. (Here <em>parameters</em> can be edge weights or biases associated with nodes or edges of the network, and the precise details of the node computations —e.g., the precise form of nonlinearity like Sigmoid or RELU— are unimportant.) Doing so gives the <em>gradient</em> $\nabla f$ of $f$ with respect to its network parameters, which allows a <em>gradient descent</em> step in the training: change all parameters simultaneously to move the vector of parameters a small amount in the direction $-\nabla f$.</p>
<p>Note that backpropagation computes the gradient exactly, but properly training neural nets needs many more tricks than just backpropagation. Understanding backpropagation is useful for appreciating some advanced tricks.</p>
<p>The importance of backpropagation derives from its efficiency. Assuming node operations take unit time, the running time is <em>linear</em>, specifically, $O(\text{Network Size}) = O(V + E)$, where $V$ is the number of nodes in the network and $E$ is the number of edges. The only technical ingredient is chain rule from calculus, but applying it naively would have resulted in quadratic running time—which would be hugely inefficient for networks with millions or even thousands of parameters.</p>
<p>Backpropagation can be efficiently implemented using highly parallel vector operations available in today’s <a href="https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units">GPUs (Graphical Processing Units)</a>, which play an important role in the the recent neural nets revolution.</p>
<p><strong>Side Note:</strong> Expert readers will recognize that in the standard accounts of neural net training,
the actual quantity of interest is the gradient of the <em>training loss</em>, which happens to be a simple function of the network output. But the above phrasing is fully general since one can simply add a new output node to the network that computes the training loss from the old output. Then the quantity of interest is indeed the gradient of this new output with respect to network parameters.</p>
<h2 id="problem-setup">Problem Setup</h2>
<p>Backpropagation applies only to acyclic networks with directed edges. (Later we briefly sketch its use on networks with cycles.)</p>
<p>Without loss of generality, acyclic networks can be visualized as being structured in numbered layers, with nodes in the $t+1$th layer getting all their inputs from the outputs of nodes in layers $t$ and earlier. We use $f \in \mathbb{R}$ to denote the output of the network. In all our figures, the input of the network is at the bottom and the output on the top.</p>
<p>We start with a simple claim that reduces the problem of computing the gradient to the problem of computing partial derivatives with respect to the nodes:</p>
<blockquote>
<p><strong>Claim 1:</strong> To compute the desired gradient with respect to the parameters, it suffices to compute $\partial f/\partial u$ for every node $u$.</p>
</blockquote>
<p>Let’s be clear what $\partial f/\partial u$ means. Suppose we cut off all the incoming edges of the node $u$, and fix/clamp the current values of all network parameters. Now imagine changing $u$ from its current value. This change may affect values of nodes at higher levels that are connected to $u$, and the final output $f$ is one such node. Then $\partial f/\partial u$ denotes the rate at which $f$ will change as we vary $u$. (Aside: Readers familiar with the usual exposition of back-propagation should note that there $f$ is the training error and this $\partial f/\partial u$ turns out to be exactly the “error” propagated back to on the node $u$.)</p>
<p>Claim 1 is a direct application of chain rule, and let’s illustrate it for a simple neural nets (we address more general networks later). Suppose node $u$ is a weighted sum of the nodes $z_1,\dots, z_m$ (which will be passed through a non-linear activation $\sigma$ afterwards). That is, we have $u = w_1z_1+\dots+w_nz_n$. By Chain rule, we have</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial w_1} = \frac{\partial f}{\partial u}\cdot \frac{\partial{u}}{\partial w_1} = \frac{\partial f}{\partial u}\cdot z_1.</script>
<p>Hence, we see that having computed $\partial f/\partial u$ we can compute $\partial f/\partial w_1$, and moreover this can be done locally by the endpoints of the edge where $w_1$ resides.</p>
<div style="text-align:center;">
<img style="width:500px;" src="http://www.cs.princeton.edu/~tengyu/forblog/weight5.jpg" />
</div>
<h3 id="multivariate-chain-rule">Multivariate Chain Rule</h3>
<p>Towards computing the derivatives with respect to the nodes, we first recall the multivariate Chain rule, which handily describes the relationships between these partial derivatives (depending on the graph structure).</p>
<p>Suppose a variable $f$ is a function of variables $u_1,\dots, u_n$, which in turn depend on the variable $z$. Then, multivariate Chain rule says that</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial z} = \sum_{j=1}^n \frac{\partial f}{\partial u_j}\cdot \frac{\partial u_j}{\partial z}~.</script>
<p>This is a direct generalization of eqn. (2) and a sub-case of eqn. (11) in this <a href="http://mathworld.wolfram.com/ChainRule.html">description of chain rule</a>.</p>
<p>This formula is perfectly suitable for our cases. Below is the same example as we used before but with a different focus and numbering of the nodes.</p>
<div style="text-align:center;">
<img style="width:500px;" src="http://www.cs.princeton.edu/~tengyu/forblog/chain_rule_5.jpg" />
</div>
<p>We see that given we’ve computed the derivatives with respect to all the nodes that is above the node $z$, we can compute the derivative with respect to the node $z$ via a weighted sum, where the weights involve the local derivative ${\partial u_j}/{\partial z}$ that is often easy to compute. This brings us to the question of how we measure running time. For book-keeping, we assume that</p>
<blockquote>
<p><strong>Basic assumption:</strong> If $u$ is a node at level $t+1$ and $z$ is any node at level $\leq t$ whose output is an input to $u$, then computing $\frac{\partial u}{\partial z}$ takes unit time on our computer.</p>
</blockquote>
<h3 id="naive-feedforward-algorithm-not-efficient">Naive feedforward algorithm (not efficient!)</h3>
<p>It is useful to first point out the naive quadratic time algorithm implied by the chain rule. Most authors skip this trivial version, which we think is analogous to teaching sorting using only quicksort, and skipping over the less efficient bubblesort.</p>
<p>The naive algorithm is to compute $\partial u_i/\partial u_j$ for every pair of nodes where $u_i$ is at a higher level than $u_j$. Of course, among these $V^2$ values (where $V$ is the number of nodes) are also the desired ${\partial f}/{\partial u_i}$ for all $i$ since $f$ is itself the value of the output node.</p>
<p>This computation can be done in feedforward fashion. If such value has been obtained for every $u_j$ on the level up to and including level $t$, then one can express (by inspecting the multivariate chain rule) the value $\partial u_{\ell}/\partial u_j$ for some $u_{\ell}$ at level $t+1$ as a weighted combination of values $\partial u_{i}/\partial u_j$ for each $u_i$ that is a direct input to $u_{\ell}$. This description shows that the amount of computation for a fixed $j$ is proportional to the number of edges $E$. This amount of work happens for all $V$ values of $j$, letting us conclude that the total work in the algorithm is $O(VE)$.</p>
<h2 id="backpropagation-linear-time">Backpropagation (Linear Time)</h2>
<p>The more efficient backpropagation, as the name suggests, computes the partial derivatives in the reverse direction. Messages are passed in one wave backwards from higher number layers to lower number layers. (Some presentations of the algorithm describe it as dynamic programming.)</p>
<blockquote>
<p><strong>Messaging protocol:</strong>
The node $u$ receives a message along each outgoing edge from the node at the other end of that edge. It sums these messages to get a number $S$ (if $u$ is the output of the entire net, then define $S=1$) and then it sends the following message to any node $z$ adjacent to it at a lower level:
<script type="math/tex">S \cdot \frac{\partial u}{\partial z}</script></p>
</blockquote>
<p>Clearly, the amount of work done by each node is proportional to its degree, and thus overall work is the sum of the node degrees. Summing all node degrees counts each edge twice, and thus the overall work is
$O(\text{Network Size})$.</p>
<p>To prove correctness, we prove the following:</p>
<blockquote>
<p><strong>Main Claim</strong>: At each node $z$, the value $S$ is exactly ${\partial f}/{\partial z}$.</p>
</blockquote>
<p><em>Base Case</em>: At the output layer this is true, since ${\partial f}/{\partial f} =1$.</p>
<p><em>Inductive case</em>: Suppose the claim was true for layers $t+1$ and higher and $u$ is at layer $t$, with outgoing edges go to some nodes $u_1, u_2, \ldots, u_m$ at levels $t+1$ or higher. By inductive hypothesis, node $z$ indeed receives $ \frac{\partial f}{\partial u_j}\times \frac{\partial u_j}{\partial z}$ from each of $u_j$. Thus by Chain rule,
<script type="math/tex">S= \sum_{i =1}^m \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial z}=\frac{\partial f}{\partial z}.</script>
This completes the induction and proves the Main Claim.</p>
<h2 id="auto-differentiation">Auto-differentiation</h2>
<p>Since the exposition above used almost no details about the network and the operations that the node perform, it extends to every computation that can be organized as an acyclic graph whose each node computes a differentiable function of its incoming neighbors. This observation underlies many auto-differentiation packages such as <a href="https://github.com/HIPS/autograd">autograd</a> or <a href="https://www.tensorflow.org/">tensorflow</a>: they allow computing the gradient of the output of such a computation with respect to the network parameters.</p>
<p>We first observe that Claim 1 continues to hold in this very general setting. This is without loss of generality because we can view the parameters associated to the edges as also sitting on the nodes (actually, leaf nodes). This can be done via a simple transformation to the network; for a single node it is shown in the picture below; and one would need to continue to do this transformation in the rest of the networks feeding into $u_1, u_2,..$ etc from below.</p>
<div style="text-align:center;">
<img style="width:800px;" src="http://www.cs.princeton.edu/~tengyu/forblog/change_view" />
</div>
<p>Then, we can use the messaging protocol to compute the derivatives with respect to the nodes, as long as the local partial derivative can be computed efficiently. We note that the algorithm can be implemented in a fairly modular manner: For every node $u$, it suffices to specify (a) how it depends on the incoming nodes, say, $z_1,\dots, z_n$ and (b) how to compute the partial derivative times $S$, that is, $S \cdot \frac{\partial u}{\partial z_j}$.</p>
<p><em>Extension to vector messages</em>: In fact (b) can be done efficiently in more general settings where we allow the output of each node in the network to be a vector (or even matrix/tensor) instead of only a real number. Here we need to replace $\frac{\partial u}{\partial z_j}\cdot S$ by $\frac{\partial u}{\partial z_j}[S]$, which denotes the result of applying the operator $\frac{\partial u}{\partial z_j}$ on $S$. We note that to be consistent with the convention in the usual exposition of backpropagation, when $y\in \mathbb{R}^{p}$ is a funciton of $x\in \mathbb{R}^q$, we use $\frac{\partial y}{\partial x}$ to denote $q\times p$ dimensional matrix with $\partial y_j/\partial x_i$ as the $(i,j)$-th entry. Readers might notice that this is the transpose of the usual Jacobian matrix defined in mathematics. Thus $\frac{\partial y}{\partial x}$ is an operator that maps $\mathbb{R}^p$ to $\mathbb{R}^q$ and we can verify $S$ has the same dimension as $u$ and $\frac{\partial u}{\partial z_j}[S]$ has the same dimension as $z_j$.</p>
<p>For example, as illustrated below, suppose the node $U\in \mathbb{R}^{d_1\times d_3} $ is a product of two matrices $W\in \mathbb{R}^{d_2\times d_3}$ and $Z\in \mathbb{R}^{d_1\times d_2}$. Then we have that $\partial U/\partial Z$ is a linear operator that maps $\mathbb{R}^{d_2\times d_3}$ to $\mathbb{R}^{d_1\times d_3}$, which naively requires a matrix representation of dimension $d_2d_3\times d_1d_3$. However, the computation (b) can be done efficiently because
<script type="math/tex">\frac{\partial U}{\partial Z}[S]= W^{\top}S.</script></p>
<p>Such vector operations can also be implemented efficiently using today’s GPUs.</p>
<div style="text-align:center;">
<img style="width:200px;" src="http://www.cs.princeton.edu/~tengyu/forblog/mult.jpg" />
</div>
<h2 id="notable-extensions">Notable Extensions</h2>
<p>1) <em>Allowing weight tying.</em> In many neural architectures, the designer wants to force many network units such as edges or nodes to share the same parameter. For example, in <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network"><em>convolutional neural nets</em></a>, the same filter has to be applied all over the image, which implies reusing the same parameter for a large set of edges between the two layers.</p>
<p>For simplicity, suppose two parameters $a$ and $b$ are supposed to share the same value. This is equivalent to adding a new node $u$ and connecting $u$ to both $a$ and $b$ with the operation $a = u$ and $b=u$. Thus, by chain rule, <script type="math/tex">\frac{\partial f}{\partial u} = \frac{\partial f}{\partial a}\cdot \frac{\partial a}{\partial u}+\frac{\partial f}{\partial b}\cdot \frac{\partial b}{\partial u} = \frac{\partial f}{\partial a}+ \frac{\partial f}{\partial b}.</script> Hence, equivalently, the gradient with respect to a shared parameter is the sum of the gradients with respect to individual occurrences.</p>
<p>2) <em>Backpropagation on networks with loops.</em> The above exposition assumed the network is acyclic. Many cutting-edge applications such as machine translation and language understanding use networks with directed loops (e.g., recurrent neural networks). These architectures —all examples of the “differentiable computing” paradigm below—can get complicated and may involve operations on a separate memory as well as mechanisms to shift attention to different parts of data and memory.</p>
<p>Networks with loops are trained using gradient descent as well, using <a href="https://en.wikipedia.org/wiki/Backpropagation_through_time">back-propagation through time</a>, which consists of expanding the network through a finite number of time steps into an acyclic graph, with replicated copies of the same
network. These replicas share the weights (weight tying!) so the gradient can be computed. In practice an issue may arise with <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">exploding or vanishing gradients</a> which impact convergence. Such issues can be carefully addressed in practice by clipping the gradient or re-parameterization techniques such as <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">long short-term memory</a>.</p>
<p>The fact that the gradient can be computed efficiently for such general networks with loops has motivated neural net models with memory or even data structures (see for example <a href="https://en.wikipedia.org/wiki/Neural_Turing_machine">neural Turing machines</a> and <a href="https://en.wikipedia.org/wiki/Differentiable_neural_computer">differentiable neural computer</a>). Using gradient descent, one can optimize over a family of parameterized networks with loops to find the best one that solves a certain computational task (on the training examples). The limits of these ideas are still being explored.</p>
<p>3) <em>Hessian-vector product in linear time.</em> It is possible to generalize backprop to enable 2nd order optimization in “near-linear” time, not just gradient descent,
as shown in recent independent manuscripts of <a href="https://arxiv.org/pdf/1611.00756.pdf">Carmon et al.</a> and <a href="https://arxiv.org/pdf/1611.01146.pdf">Agarwal et al.</a> (NB: Tengyu is a coauthor on this one.). One essential step is to
compute the product of the <a href="https://en.wikipedia.org/wiki/Hessian_matrix">Hessian matrix</a> and a vector, for which <a href="http://www.bcl.hamilton.ie/~barak/papers/nc-hessian.pdf">Pearlmutter’93</a> gave an efficient algorithm. Here we show how to do this in $O(\mbox{Network size})$ using the ideas above. We need a slightly stronger version of the back-propagation result than the one in the previous subsection:</p>
<blockquote>
<p><strong>Claim (informal):</strong> Suppose an acyclic network with $V$ nodes and $E$ edges has output $f$ and leaves $z_1,\dots, z_m$. Then there exists a network of size $O(V+E)$ that has $z_1,\dots, z_m$ as input nodes and $\frac{\partial f}{\partial z_1},\dots, \frac{\partial f}{\partial z_m}$ as output nodes.</p>
</blockquote>
<p>The proof of the Claim follows in straightforward fashion from implementing the message passing protocol as an acyclic circuit.</p>
<p>Next we show how to compute $\nabla^2 f(z)\cdot v$ where $v$ is a given fixed vector. Let $g(z)= \langle \nabla f(z),v\rangle$ be a function from $\mathbb{R}^d\rightarrow \mathbb{R}$. Then by the Claim above, $g(z)$ can be computed by a network of size $O(V+E)$. Now apply the Claim again on $g(z)$, we obtain that $\nabla g(z)$ can also be computed by a network of size $O(V+E)$.</p>
<p>Note that by construction,
<script type="math/tex">\nabla g(z) = \nabla^2 f(z)\cdot v.</script>
Hence we have computed the Hessian vector product in network size time.</p>
<p>##That’s all!</p>
<p>Please write your comments on this exposition and whether it can be improved.</p>
Tue, 20 Dec 2016 09:00:00 -0800
http://offconvex.github.io/2016/12/20/backprop/
http://offconvex.github.io/2016/12/20/backprop/The search for biologically plausible neural computation: The conventional approach<p>Inventors of the original artificial neural networks (NNs) derived their inspiration from biology.
However, as artificial NNs progressed, their design was less guided by neuroscience facts.
Meanwhile, progress in neuroscience has altered our conceptual understanding of neurons.
Consequently, we believe that many successful artificial NNs resemble natural NNs only
superficially violating fundamental constraints imposed by biological hardware.</p>
<p>The wide gap between the artificial and natural NN designs raises intriguing questions: What
algorithms underlie natural NNs? Can insights from biology help build better artificial
NNs?</p>
<p>This is the first of a series of posts aimed at explaining recent progress made
by my collaborators and myself towards biologically plausible NNs. Such networks can
serve both as models of natural NNs and as general
purpose artificial NNs. We have found that respecting biological constraints
actually helps development of artificial NNs by guiding design decisions.</p>
<p>In this post, I cover the background material, going back several decades.
I sketch a biological neuron, introduce primary biological constraints, and discuss
the conventional approach to deriving artificial NNs. I will show that while
the conventional approach generates a reasonable algorithmic model of a <em>single</em>
biological neuron, multi-neuron networks violate biological constraints. In future
posts we will see how to fix that.</p>
<h2 id="a-sketch-of-a-biological-neuron">A Sketch of a Biological Neuron</h2>
<p>Here is the minimum biological background needed to understand the rest of the post.</p>
<p>A biological neuron receives signals from multiple neurons, computes their weighted sum and
generates a signal transmitted to multiple neurons, Figure 1. Each neuron’s signaling activity is
quantified by the <em>firing rate</em>, which is a nonnegative real number that varies over time. Each synapse
scales the input from the corresponding upstream neuron onto the receiving neuron by its
weight. The receiving neuron sums scaled inputs, i.e. computes the inner product of the
upstream activity vector and the synaptic weight vector. The inner product passes
through a nonlinearity called the activation function and the output is transmitted
to downstream neurons.</p>
<p>Synaptic weight changes over time, typically, on a slower time scale than neuronal signals. The
weight depends on neuronal signals per so-called learning rules. For example, in commonly
used <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian learning rules</a>, synaptic weight is
proportional to the correlation between the activities of the two neurons a synapse connects,
i.e. pre- and postsynaptic.</p>
<p><img src="https://docs.google.com/uc?export=download&id=0B5Pfjcu55RQQb0F1MllnRTdweDQ" alt="" title="Title" />
Figure 1: A biological neuron modelled by an online algorithm. Left: A biological neuron receives
inputs from the upstream neurons (green) which are scaled by the weights of corresponding
synapses (blue). The neuron (black) computes output, $y$, as a function of the weighted input sum. Right: Online algorithm outputs an activation function of the inner product of the
synaptic weight vector and an upstream activity vector. Synaptic weights are modified by
neuronal activities (dashed line) per learning rules.</p>
<h2 id="primary-biological-constraints">Primary Biological Constraints</h2>
<p>To determine which algorithmic models in this post are biologically plausible,
we can focus on a few key biological constraints.</p>
<p>Biologically plausible algorithms must be formulated in the <strong>online</strong> (or streaming), rather than
offline (or batch), setting. This means that input data are streamed to the algorithm
sequentially, one sample at a time, and the corresponding output must be computed before
the next input sample arrives. The output communicated to downstream neurons cannot be
modified in the future. A neuron cannot store individual past inputs or outputs except in a
highly compressed format limited to synaptic weights and a few state variables.</p>
<p>In biologically plausible NNs, learning rules must be <strong>local</strong>. This means that the synaptic weight
update may depend on the activities of only the two neurons a synapse connects, as for
example, in Hebbian learning. Activities of other neurons are not physically available to a
synapse and therefore including them into learning rules would be biologically
implausible. Modern artificial NNs, such as
backpropagation-based deep learning networks, rely on nonlocal learning rules.</p>
<p>Our initial focus is on <strong>unsupervised</strong> learning. This is not a hard constraint, but rather a matter of
priority. Whereas humans are clearly capable of supervised learning, most of our learning tasks
lack big labeled datasets. On the mechanistic level, most neurons lack a clear supervision signal.</p>
<h2 id="single-neuron-online-principal-component-analysis-pca">Single-neuron Online Principal Component Analysis (PCA)</h2>
<p>In 1982, <a href="https://pdfs.semanticscholar.org/38bc/c5d342accf5249514cdfdaaa40871a93252c.pdf">Oja proposed</a>
modeling a neuron by an online PCA algorithm. PCA is a workhorse of data
analysis used for dimensionality reduction, denoising, and latent factor
discovery. Therefore, Oja’s seminal paper established that biological processes in
a neuron can be viewed as the steps of an online algorithm solving a useful
computational objective.</p>
<p>Oja’s single-neuron online PCA algorithm works as follows. At each time step, $t$,
it receives an input data sample, ${\bf x_t}$, computes and outputs
the corresponding top principal component value, $y_t$:</p>
<script type="math/tex; mode=display">y_t \leftarrow {\bf w} _{t-1}^\top {\bf x}_t. \qquad \qquad \qquad (1.1)</script>
<p>Here and below lowercase boldfaced letters designate vectors. Then the algorithm updates the
(normalized) feature vector,</p>
<script type="math/tex; mode=display">{\bf w} _t \leftarrow {\bf w} _{t-1}+ \eta ( {\bf x} _t- {\bf w} _{t-1} y_t ) y_t. \qquad \qquad (1.2)</script>
<p>The feature vector, ${\bf w}$, converges to the eigenvector of input covariance if data are drawn i.i.d from
a stationary distribution.</p>
<p>The steps of the Oja algorithm (1.1-1.2) correspond to the operations of the biological neuron. If the
input vector is represented by the activities of the upstream neurons, (1.1) represents weighted
summation of the inputs by the output neuron. If the activation function is linear the output, $y_t$,
is simply the weighted sum. The update (1.2) is a local Hebbian synaptic learning rule. The
first term of the update is proportional to the correlation of the pre- and postsynaptic neurons’
activities and the second term, also local, normalizes the synaptic weight vector.</p>
<h2 id="a-normative-theory">A Normative Theory</h2>
<p>Next, we would like to build on Oja’s insightful identification of biological
processes with the steps of the online PCA algorithm by computing multiple
principal components using multi-neuron NNs and including the activation nonlinearity.</p>
<p>Instead of trying to extend the Oja model heuristically, we take a more
systematic, so-called normative approach. In this approach, a biological
model is viewed as the solution of an optimization problem. Specifically, we
postulate an objective function motivated by a computational principle,
derive an online algorithm optimizing such objective, and map the steps
of the algorithm onto biological processes.</p>
<p>Having such normative theory allows us to navigate through the space of
possible algorithmic models in a more efficient and systematic way. Mathematical
compactness of objective functions facilitates generating new models and
weeding out inconsistent ones. This is similar to the Hamiltonian approach
in physics which leverages natural symmetries and safeguards against the violation
of the first law of thermodynamics (energy conservation).</p>
<h2 id="deriving-a-single-neuron-online-pca-using-the-reconstruction-approach">Deriving a Single-neuron Online PCA using the Reconstruction Approach</h2>
<p>To build a normative theory, we first need to derive Oja’s single-neuron online algorithm by
solving an optimization problem. What objective function should we choose for online PCA?
Historically, neural activity has been often viewed as representing each data sample, ${\bf x}_t$, by the
feature vector, ${\bf w}$, scaled by the output, $y_t$, Figure 2. Such reconstruction approach is naturally
formalized as the minimization of the reconstruction (or coding) error:</p>
<script type="math/tex; mode=display">\min_{\| {\bf w} \|=1} {\sum \limits_{t=1}^{T} \min_{ y_t} {\| {\bf x}_t-{\bf w} y_t \| ^2_2}}. \qquad \qquad \qquad \quad (1.3)</script>
<p>In the offline setting, optimization problem (1.3) is solved by PCA: the optimum ${\bf w}$ is the
eigenvector of input covariance corresponding to the top eigenvalue and the optimum output, $y$, is the first principal component.</p>
<p><img src="https://docs.google.com/uc?export=download&id=0B5Pfjcu55RQQaUtLTmFmbjZ3eWs" alt="" /></p>
<p>Figure 2. PCA represents data samples (circles) by their projections (red) onto the top
eigenvector, ${\bf w}$. These projections constitute the top principal component. Objective (1.3)
minimizes the reconstruction error (blue).</p>
<p>In the online setting, (1.3) can be solved by alternating minimization, which has been a subject
of <a href="http://www.offconvex.org/2016/05/08/almostconvexitySATM/">recent analysis</a>. After the arrival of each data point, <script type="math/tex">{\bf x}_t</script> , the algorithm computes optimum
output, $y_t$, while keeping the feature vector, ${\bf w}_{t-1}$, computed at the previous time step, fixed.
By using calculus, one finds that the optimum output is given by (1.1). Then, the algorithm
minimizes the total reconstruction error with respect to the feature vector while keeping all the
outputs fixed. Again, resorting to calculus, one finds (1.2).</p>
<p>Thus, the single-neuron online PCA algorithm may be derived using the reconstruction approach. To
compute multiple principal components, we need to extend this success to multi-neuron
networks.</p>
<h2 id="the-reconstruction-approach-fails-for-multi-neuron-networks">The Reconstruction Approach Fails for Multi-neuron Networks</h2>
<p>Though the reconstruction approach yields a multi-component online PCA
algorithm, the corresponding NNs are <em>not</em> biologically plausible.</p>
<p>Extension of the reconstruction error objective from single to multiple output components is
straightforward - each scalar, $y_t$, is replaced by a vector, ${\bf y}_t$:</p>
<script type="math/tex; mode=display">\min_{\rm diag({\bf W}^\top {\bf W})={\bf I}} {\sum \limits_{t=1}^{T} \min_{ {\bf y}_t} {\| {\bf x}_t-{\bf W} {\bf y}_t \| ^2_2}}. \qquad \qquad \qquad \quad (1.4)</script>
<p>Here matrix ${\bf W}$ comprises column-vectors corresponding to different features. As in the single-
neuron case this objective can be optimized online by alternating minimization. After the arrival
of data sample, ${\bf x}_t$, the feature vectors are kept fixed while the objective (1.4) is minimized
with respect to the principal components by iterating the following update until convergence:</p>
<script type="math/tex; mode=display">{\bf y}_t \leftarrow {\bf W} _{t-1}^\top {\bf x}_t - {\bf W} _{t-1}^\top {\bf W} _{t-1} {\bf y}_t . \qquad \qquad \qquad (1.5)</script>
<p>Minimizing the total objective with respect to the feature vectors for fixed principal
components yields the following update:</p>
<script type="math/tex; mode=display">{\bf W} _t \leftarrow {\bf W} _{t-1}+ \eta ( {\bf x} _t- {\bf W} _{t-1} {\bf y}_t ) {\bf y}_t^\top \cdot \qquad \qquad (1.6)</script>
<p>As before, in NN implementations of algorithm (1.5-1.6), feature vectors are represented by
synaptic weights and principal components by the activities of output neurons. Then (1.5) can
be implemented by a single-layer NN, Figure 3, in which activity dynamics converges
faster than the time interval between the arrival of successive data samples.</p>
<p>However, implementing update (1.6) in the single-layer NN architecture, Figure 3, requires
nonlocal learning rules making it biologically implausible. Indeed, the last term in (1.6) implies
that updating the weight of a synapse requires the knowledge of output activities of all other
neurons which are not available to the synapse. Moreover, the matrix of lateral connection
weights, $- {\bf W} _{t-1}^\top {\bf W} _{t-1}$, in the last term of (1.5) is computed as a Grammian of feedforward weights,
clearly a nonlocal operation. This problem is not limited to PCA and arises in networks of
nonlinear neurons as well.</p>
<p>Rather than deriving learning rules from a principled objective, many authors
constructed biologically plausible single-layer networks using local learning
rules, Hebbian for feedforward and anti-Hebbian (meaning there is a minus sign
in front of the correlation-based synaptic weight as for the last term in (1.5)).
However, in my view, abandoning the normative approach creates more problems
than it solves.</p>
<p><img src="https://docs.google.com/uc?export=download&id=0B5Pfjcu55RQQMlRNVDVBRDJ0TEk" alt="" /></p>
<p>Figure 3. The single-layer NN implementation of the multi-neuron online PCA algorithm derived
using the reconstruction approach requires nonlocal learning rules.</p>
<p>I have outlined how the conventional reconstruction approach fails to generate
biologically plausible multi-neuron networks for online PCA. In the next post,
I will introduce an alternative approach that overcomes this limitation.
Moreover, this approach suggests a novel view of neural computation leading to
many interesting extensions.</p>
<p>(<em>Acknowledgement: I am grateful to Sanjeev Arora for his support and encouragement as well as to Cengiz Pehlevan, Leo Shklovskii, Emily Singer, and Thomas Lin for their comments on the earlier versions.</em>)</p>
Thu, 03 Nov 2016 03:00:00 -0700
http://offconvex.github.io/2016/11/03/MityaNN1/
http://offconvex.github.io/2016/11/03/MityaNN1/Gradient Descent Learns Linear Dynamical Systems<p>From text translation to video captioning, learning to map one sequence to another is an increasingly active research area in machine learning. Fueled by the success of recurrent neural networks in its many variants, the field has seen rapid advances over the last few years. Recurrent neural networks are typically trained using some form of stochastic gradient descent combined with backpropagation for computing derivatives. The fact that gradient descent finds a useful set of parameters is by no means obvious. The training objective is typically non-convex. The fact that the model is allowed to maintain state is an additional obstacle that makes training of recurrent neural networks challenging.</p>
<p>In this post, we take a step back to reflect on the mathematics of recurrent neural networks. Interpreting recurrent neural networks as dynamical systems, we will show that stochastic gradient descent successfully learns the parameters of an unknown <em>linear</em> dynamical system even though the training objective is non-convex. Along the way, we’ll discuss several useful concepts from control theory, a field that has studied linear dynamical systems for decades. Investigating stochastic gradient descent for learning linear dynamical systems not only bears out interesting connections between machine learning and control theory, it might also provide a useful stepping stone for a deeper undestanding of recurrent neural networks more broadly.</p>
<h2 id="linear-dynamical-systems">Linear dynamical systems</h2>
<p>We focus on time-invariant single-input single-output system. For an input sequence of real numbers $x_1,\dots, x_T\in \mathbb{R}$, the system maintains a sequence of hidden states $h_1,\dots, h_T\in \mathbb{R}^n$, and produces a sequence of outputs $y_1,\dots, y_T\in \mathbb{R}$ according to the following rules:</p>
<script type="math/tex; mode=display">h_{t+1} = Ah_t + Bx_t~~~~~~~~~~~~~~~~~~~~~</script>
<script type="math/tex; mode=display">\quad \quad\quad~y_t = Ch_t+Dx_t+\xi_t ~~~~~~~~~~~~~~~~(1)</script>
<p>Here $A,B,C,D$ are linear transformations with compatible dimensions, and $\xi_t$ is Gaussian noise added to the output at each time. In the learning problem, often called system identification in control theory, we observe samples of input-output pairs $((x_1,\dots, x_T),(y_1,\dots y_T))$ and aim to recover the parameters of the underlying linear system.</p>
<p>Although control theory provides a rich set of techniques for identifying and manipulating linear systems, maximum likelihood estimation with stochastic gradient descent remains a popular heuristic.</p>
<p>We denote by $\Theta = (A,B,C,D)$ the parameters of the true system. We parametrize our model with $\widehat{\Theta} = (\hat{A},\hat{B},\hat{C},\hat{D})$, and the trained model maintains hidden states $\hat{h}_t$ and outputs $\hat{y}_t$ exactly as in equation (1). For each given example $(x,y) = ((x_1,\dots,x_T), (y_1,\dots, y_t))$, the log-likelihood of model $\widehat{\Theta}$ is
<script type="math/tex">f(\widehat{\Theta}, (x,y)) = \frac{1}{T}\sum_{t=1}^{T}\left\|y_t-\hat{y}_t\right\|^2</script>. The population risk is defined as the expected log-likelihood,</p>
<script type="math/tex; mode=display">f(\widehat{\Theta}) = \mathbb{E}_{(x,y)} \left[f(\widehat{\Theta}, (x,y))\right]</script>
<p>Stochastic gradients of the population risk can be computed in time $O(Tn)$ via back-propagation given random samples. We can therefore directly minimize population risk using stochastic gradient descent. The question is just whether the algorithm actually converges. Even though the state transformations are linear, the objective function we defined is not convex. Luckily, we will see that the objective is still <em>close enough</em> to convex for stochastic gradient to make steady progress towards the global minimum.</p>
<h2 id="hair-dryers-and-quasi-convex-functions">Hair dryers and quasi-convex functions</h2>
<p>Before we go into the math, let’s illustrate the algorithm with a pressing example that we all run into every morning: hair drying. Imagine you have a hair dryer with a <em>low</em> temperature setting and a <em>high</em> temperature setting. Neither setting is ideal. So every morning you switch between the settings frantically in an attempt to modulate to the ideal temperature. Measuring the resulting temperature (red line below) as a function of the input setting (green dots below), the picture you’ll see is something like <a href="https://www.mathworks.com/help/ident/examples/estimating-simple-models-from-real-laboratory-process-data.html?prodcode=ML">this</a>:</p>
<div style="text-align:center;">
<img style="width:800px;" src="/assets/sysid/dryer/dryer-0.svg" />
</div>
<p>You can see that the output temperature is related to the inputs. If you set the temperature to high for long enough, you’ll eventually get a high output temperature. But the system has state. Briefly lowering the temperature has little effect on the outputs. Intuition suggests that these kind of effects should be captured by a system with two or three hidden states. So, let’s see how SGD would go about finding the parameters of the system. We’ll initialize a system with three hidden states such that before training its predictions are just the inputs of the system. We then run SGD with a fixed learning rate on the same sequence for 400 steps.</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img style="width:800px;" id="imganim" src="/assets/sysid/dryer/dryer-1.svg" onclick="forward_image()" />
</div>
<script type="text/javascript">//<![CDATA[
var images = [
"/assets/sysid/dryer/dryer-1.svg",
"/assets/sysid/dryer/dryer-2.svg",
"/assets/sysid/dryer/dryer-3.svg",
"/assets/sysid/dryer/dryer-4.svg",
"/assets/sysid/dryer/dryer-5.svg",
"/assets/sysid/dryer/dryer-6.svg",
"/assets/sysid/dryer/dryer-7.svg",
"/assets/sysid/dryer/dryer-8.svg",
]
var iC = 0
function forward_image(){
iC = iC + 1;
document.getElementById('imganim').src = images[iC%8];
document.getElementById('counter').textContent = 50* (iC%8);
}
//]]>
</script>
<p><!-- end animation --></p>
<p><em>The blue line shows the predictions of SGD after <span style="font-family:monospace;"><span id="counter">0</span>/400</span> gradient updates. Click to advance.</em></p>
<p>Evidently, gradient descent converges just fine on this example. Let’s look at the hair dryer objective function along the line segment between two random points in the domain.</p>
<div style="text-align:center;">
<img src="/assets/sysid/dryer-segment.svg" />
</div>
<p>The function is clearly not convex, but it doesn’t look too bad either. In particular, from the picture, it could be that the objective function is <em>quasi-convex</em>:</p>
<blockquote>
<p><strong>Definition:</strong> For $\tau > 0$, a function $f(\theta)$ is $\tau$-quasi-convex with respect to a global minimum $\theta ^ * $ if for every $\theta$,
<script type="math/tex">\langle \nabla f(\theta), \theta - \theta^* \rangle \ge \tau (f(\theta)-f(\theta^*)).</script></p>
</blockquote>
<p>Intuitively, quasi-convexity states that the descent direction $-\nabla f(\theta)$ is positively correlated with the ideal moving direction $\theta^* -\theta$. This implies that the potential function $\left|\theta-\theta ^ * \right|^2$ decreases in expectation at each step of stochastic gradient descent. This observation plugs nicely into the standard SGD analysis, leading to the following result:</p>
<blockquote>
<p><strong>Proposition:</strong> (informal) Suppose the population risk $f(\theta)$ is $\tau$-quasi-convex, then stochastic gradient descent (with fresh samples at each iteration and proper learning rate) converges to a point $\theta_K$ in $K$ iterations with error bounded by
$ f(\theta_K) - f(\theta^*) \leq O(1/(\tau \sqrt{K}))$.</p>
</blockquote>
<p>The key challenge for us is to understand under what conditions we can prove that the population risk objective is in fact quasi-convex. This requires some background.</p>
<h2 id="control-theory-polynomial-roots-and-pac-man">Control theory, polynomial roots, and Pac-Man</h2>
<p>A linear dynamical system $(A,B,C,D)$ is equivalent to the system $(TAT^{-1}, TB, CT^{-1}, D)$ for any invertible matrix $T$ in terms of the behavior of the outputs. A little thought shows therefore that in its unrestricted parameterization the objective function cannot have a unique optimum. A common way of removing this redundancy is to impose a canonical form. Almost all non-degenerate system admit the <em>controllable canonical form</em>, defined as</p>
<script type="math/tex; mode=display">% <![CDATA[
A\; = \;
\left[ \begin{array}{ccccc} 0 & 1 & 0 & \cdots & 0 \newline 0 & 0 & 1 & \cdots & 0 \newline
\vdots & \vdots & \vdots & \ddots & \vdots \newline 0 & 0 & 0 & \cdots & 1 \newline
-a_n & -a_{n-1} & -a_{n-2} & \cdots & -a_1 \end{array} \right]
\qquad
B = \left[ \begin{array}{c} 0\newline 0 \newline\vdots \newline 0 \newline 1 \end{array} \right] %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
C\;~= \;
\left[ \begin{array}{ccccc} c_1~~~~& c_2~~~~ & c_3~~~~& ~~\cdots\cdots~~~~& c_n \end{array} \right]
\qquad
D =~~ \left[ \begin{array}{c} d\end{array} \right] %]]></script>
<p>We will also parametrize our training model using these forms. One of its nice properties is that the coefficients of the characteristic polynomial of the <em>state transition matrix</em> $A$ can be read off from the last row of $A$. That is,
<script type="math/tex">det(zI-A) = p_a(z) := z^n+a_1z^{n-1}+\dots + a_n.</script></p>
<p>Even in controllable canonical form, it still seems rather difficult to learn arbitrary linear dynamical systems. A natural restriction would be <em>stability</em>, that is, to require that the eigenvalues of $A$ are all bounded by $1.$ Equivalently, the roots of the characteristic polynomial should all be contained in the complex unit disc. Without stability, the state of the system could blow up exponentially making robust learning difficult. But the set of all stable systems forms a non-convex domain. It seems daunting to guarantee that stochastic gradient descent would converge from an arbtirary starting point in this domain without ever leaving the domain.</p>
<p>We will therefore impose a stronger restriction on the roots of the characteristic polynomial. We call this the Pac-Man condition. You can think of it as a strengthening of stability.</p>
<blockquote>
<p><strong>Pac-Man condition</strong>: A linear dynamical system in controllable canonical form satisfies the Pac-Man condition if the coefficient vector $a$ defining the state transition matrix satisfies
<script type="math/tex">|Re(q_a(z))| > |Im(q_a(z))|</script> for all complex numbers $z$ of modulus $|z| = 1$, where $q_a(z) = p_a(z)/z^n = 1+a_1z^{-1}+\dots + a_nz^{-n}$.</p>
</blockquote>
<div style="text-align:center;">
<img style="width:350px;margin-bottom:50px;" src="/assets/sysid/pacman.png" />
<img style="width:400px;" src="/assets/sysid/trace-degree4.png" />
</div>
<p><em>Above, we illustrate this condition for a degree 4 system plotting the value of $q_a(z)$ on complex plane for all complex numbers $z$ on the unit circle.</em></p>
<p>We note that Pac-Man condition is satisfied by vectors $a$ with $|a|_1\le \sqrt{2}/2$. Moreover, if $a$ is a random Gaussian vector with expected $\ell_2$ norm bounded by $o(1/\sqrt{\log n})$, then it will satisfy Pac-Man condition with probability $1-o(1)$. Roughly speaking, the assumption requires the roots of the characteristic polynomial $p_a(z)$ are relatively dispersed inside the unit circle.</p>
<p>The Pac-Man condition has three important implications:</p>
<ol>
<li>
<p>It implies via <a href="https://en.wikipedia.org/wiki/Rouch%C3%A9%27s_theorem">Rouche’s theorem</a> that the spectral radius of A is smaller than 1 and therefore ensures stability of the system.</p>
</li>
<li>
<p>The vectors satisfying it form a convex set in $\mathbb{R}^n$.</p>
</li>
<li>
<p>Finally, it ensures that the objective function is <em>quasi-convex</em></p>
</li>
</ol>
<h2 id="main-result">Main result</h2>
<p>Relying on the Pac-Man condition, we can show:</p>
<blockquote>
<p><strong>Main theorem (Hardt, Ma, Recht, 2016)</strong>: Under the Pac-Man condition, projected gradient descent algorithm, given $N$ sample sequences of length $T$, returns parameters $\widehat{\Theta}$ with population risk
<script type="math/tex">f(\widehat{\Theta}) \le f(\Theta) + poly(n)/\sqrt{NT}.</script></p>
</blockquote>
<p>The theorem sorts out the right dependence on $N$ and $T$. Even if there is only one sequence, we can learn the system provided that the sequence is long enough. Similarly, even if sequences are really short, we can learn provided that there are enough sequences.</p>
<h2 id="quasi-convexity-in-the-frequency-domain">Quasi-convexity in the frequency domain</h2>
<p>To establish quasi-convexity under the Pac-Man condition, we will first develop an explicit formula for the population risk in frequency domain. In doing so, we assume that $x_1,\dots, x_T$ are pairwise independent with mean 0 and variance 1. We also consider the population risk as $T\rightarrow \infty$ for simplicity in this post.</p>
<p>A simple algebraic manipulation simplifies the population risk with infinite sequence length to</p>
<script type="math/tex; mode=display">\lim_{T \rightarrow \infty} f(\widehat{\Theta}) = (\hat{D}-D)^2 + \sum_{k=0}^{\infty} (\hat{C}\hat{A}^kB-CA^k B)^2.</script>
<p>The first term, $(\hat D - D)^2$ is convex and appears nowhere else. We can safely ignore it and focus on the remaining expression instead, which we call the <em>idealized risk</em>:</p>
<script type="math/tex; mode=display">g(\widehat{\Theta}) = \sum_{k=0}^{\infty} (\hat{C}\hat{A}^kB-CA^k B)^2</script>
<p>To deal with the sequence $\hat{C}\hat{A}^kB$, we take its Fourier transform and obtain that</p>
<script type="math/tex; mode=display">\hat{C}\hat{A}^kB, k\ge 1 ~~~~\longrightarrow ~~~~~~~\widehat{G}_{\lambda} = \frac{\hat{c}_1e^{(n-1)\lambda}+\dots+ \hat{c}_n}{e^{n\lambda} + \hat{a}_1e^{(n-1)\lambda}+\dots+\hat{a}_n}, \lambda\in [0,2\pi]</script>
<p>Similarly we take the Fourier transform of $CA^kB$, denoted by $G_{\lambda}$. Then by Parseval’s Theorem, we obtain the following alternative representation of the population risk,</p>
<script type="math/tex; mode=display">f(\widehat{\Theta}) = \int_{0}^{2\pi} |G_{\lambda}-\widehat{G}_{\lambda}|^2 d\lambda.</script>
<p>Mapping out $G_\lambda$ and $\widehat G_\lambda$ for all $\lambda\in [0, 2\pi]$ gives the following picture:</p>
<div style="text-align:center;">
<img style="width:400px;" src="/assets/sysid/transfer/approx-10.png" onclick="forward_transfer_image()" />
<img style="width:400px;" id="transfer-img" src="/assets/sysid/transfer/approx-00.png" onclick="forward_transfer_image()" />
</div>
<script type="text/javascript">//<![CDATA[
var transfer_images = [
"/assets/sysid/transfer/approx-00.png",
"/assets/sysid/transfer/approx-01.png",
"/assets/sysid/transfer/approx-02.png",
"/assets/sysid/transfer/approx-03.png",
"/assets/sysid/transfer/approx-04.png",
"/assets/sysid/transfer/approx-05.png",
"/assets/sysid/transfer/approx-06.png",
"/assets/sysid/transfer/approx-07.png",
"/assets/sysid/transfer/approx-08.png",
"/assets/sysid/transfer/approx-09.png",
"/assets/sysid/transfer/approx-10.png",
]
var iA = 0
function forward_transfer_image(){
iA = iA + 1;
document.getElementById('transfer-img').src = transfer_images[iA%11];
document.getElementById('transfer-counter').textContent = (iA%11);
}
//]]>
</script>
<p><em>Left: Target transfer function $G$. Right: Approximation $\widehat G$ at step <span style="font-family:monospace" id="transfer-counter">0</span>/10. Click to advance.</em></p>
<p>Given this pretty representation of the idealized risk objective, we can finally prove our main lemma.</p>
<blockquote>
<p><strong>Lemma:</strong> Suppose $\Theta$ satisfies the Pac-Man condition. Then,
for every $0\le \lambda\le 2\pi$, $|G_{\lambda}-\widehat{G}_{\lambda}|^2$,
as a function of $\hat{A},\hat{C}$ is quasi-convex in the Pac-Man region.</p>
</blockquote>
<p>The lemma reduces to the following simple claim.</p>
<blockquote>
<p><strong>Claim:</strong> The function $h(\hat{u},\hat{v}) = |\hat{u}/\hat{v} - u/v|^2$ is quasi-convex in the region where $Re(\hat{v}/v) > 0$.</p>
</blockquote>
<p>The proof simply involves computing the gradients and checking the conditions for quasi-convexity by elementary algebra. We omit a formal proof, but intead show a plot of the function $h(\hat{u}, \hat{v}) = (\hat{u}/\hat{v}- 1)^2$ over the reals:</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img style="height:600px" id="3dplot-img" src="/assets/sysid/3dplot/3dplot-30.jpg" onclick="forward_3dplot_image()" />
<p style="text-align:center;"> Click to rotate.</p>
</div>
<script type="text/javascript">//<![CDATA[
var plot3d_images = [
"/assets/sysid/3dplot/3dplot-0.jpg",
"/assets/sysid/3dplot/3dplot-10.jpg",
"/assets/sysid/3dplot/3dplot-20.jpg",
"/assets/sysid/3dplot/3dplot-30.jpg",
"/assets/sysid/3dplot/3dplot-40.jpg",
"/assets/sysid/3dplot/3dplot-50.jpg",
"/assets/sysid/3dplot/3dplot-60.jpg",
"/assets/sysid/3dplot/3dplot-70.jpg",
"/assets/sysid/3dplot/3dplot-80.jpg",
"/assets/sysid/3dplot/3dplot-90.jpg",
]
var iB = 3
var inc_sign = 1
function forward_3dplot_image(){
iB = iB + inc_sign;
if (iB == 9) {
inc_sign = -1;
}
if (iB == 0) {
inc_sign = 1;
}
document.getElementById('3dplot-img').src = plot3d_images[iB];
}
//]]>
</script>
<p><!-- end animation --></p>
<p>To see how the lemma follows from the previous claim we note that quasi-convexity is preserved under composition with any linear transformation. Specifically, $h(z)$ is quasi-convex, then $h(R x)$ is also quasi-convex for any linear map $R$. So, consider the linear map:</p>
<script type="math/tex; mode=display">(\hat{a},\hat{c})\mapsto (\hat u, \hat v) = (\hat{c}_1e^{(n-1)\lambda}+\dots+ \hat{c}_n, e^{n\lambda}
+\hat{a}_1e^{(n-1)\lambda}+\dots+\hat{a}_n)</script>
<p>With this linear transformation, our simple claim about a bivariate function extends to show that $(G_{\lambda}-\widehat{G}_{\lambda})^2$ is quasi-convex when $Re(\hat{v}/v) \ge 0$. In particular, when $\hat{a}$ and $a$ both satisfy the Pac-Man condition, then $\hat{v}$ and $v$ both reside in the 90 degree wedge. Therefore they have an angle smaller than 90 degree. This implies that $Re(\hat{v}/v) > 0$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We saw conditions under which stochastic gradient descent successfully learns a linear dynamical system. In <a href="https://arxiv.org/abs/1609.05191">our paper</a>, we further show that allowing our learned system to have more parameters than the target system makes the problem dramatically easier. In particular, at the expense of slight over-parameterization we can weaken the Pac-Man condition to a mild separation condition on the roots of the characteristic polynomial. This is consistent with empirical observations both in machine learning and control theory that highlight the effectiveness of additional model parameters.</p>
<p>More broadly, we hope that our techniques will be a first stepping stone toward a better theoretical understanding of recurrent neural networks.</p>
Thu, 13 Oct 2016 03:00:00 -0700
http://offconvex.github.io/2016/10/13/gradient-descent-learns-dynamical-systems/
http://offconvex.github.io/2016/10/13/gradient-descent-learns-dynamical-systems/Linear algebraic structure of word meanings<p>Word embeddings capture the meaning of a word using a low-dimensional vector and are ubiquitous in natural language processing (NLP). (See my earlier <a href="http://www.offconvex.org/2015/12/12/word-embeddings-1/">post 1</a>
and <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a>.) It has always been unclear how to interpret the embedding when the word in question is <em>polysemous,</em> that is, has multiple senses. For example, <em>tie</em> can mean an article of clothing, a drawn sports match, and a physical action.</p>
<p>Polysemy is an important issue in NLP and much work relies upon <a href="https://wordnet.princeton.edu/">WordNet</a>, a hand-constructed repository of word senses and their interrelationships. Unfortunately, good WordNets do not exist for most languages, and even the one in English is believed to be rather incomplete. Thus some effort has been spent on methods to find different senses of words.</p>
<p>In this post I will talk about <a href="https://arxiv.org/abs/1601.03764">my joint work with Li, Liang, Ma, Risteski</a> which shows that actually word senses are easily accessible in many current word embeddings. This goes against conventional wisdom in NLP, which is that <em>of course</em>, word embeddings do not suffice to capture polysemy since they use a single vector to represent the word, regardless of whether the word has one sense, or a dozen. Our work shows that major senses of the word lie in linear superposition within the embedding, and are extractable using sparse coding.</p>
<p>This post uses embeddings constructed using our method and the wikipedia corpus, but similar techniques also apply (with some loss in precision) to other embeddings described in <a href="http://www.offconvex.org/2015/12/12/word-embeddings-1/">post 1</a> such as word2vec, Glove, or even the decades-old PMI embedding.</p>
<h2 id="a-surprising-experiment">A surprising experiment</h2>
<p>Take the viewpoint –simplistic yet instructive– that a polysemous word like <em>tie</em> is a single lexical token that represents unrelated words <em>tie1</em>, <em>tie2</em>, …
Here is a surprising experiment that suggests that the embedding for <em>tie</em> should be approximately a weighted sum of the (hypothethical) embeddings of <em>tie1</em>, <em>tie2</em>, …</p>
<blockquote>
<p>Take two random words $w_1, w_2$. Combine them into an artificial polysemous word $w_{new}$ by replacing every occurrence of $w_1$ or $w_2$ in the corpus by $w_{new}.$ Next, compute an embedding for $w_{new}$ using the same embedding method while deleting embeddings for $w_1, w_2$ but preserving the embeddings for all other words. Compare the embedding $v_{w_{new}}$ to linear combinations of $v_{w_1}$ and
$v_{w_2}$.</p>
</blockquote>
<p>Repeating this experiment with a wide range of values for the ratio $r$ between the frequencies of $w_1$ and $w_2$, we find that $v_{w_{new}}$ lies close to the subspace spanned by $v_{w_1}$ and $v_{w_2}$: the cosine of its angle with the subspace is on average $0.97$ with standard deviation $0.02$. Thus $v_{w_{new}} \approx \alpha v_{w_1} + \beta v_{w_2}$.
We find that $\alpha \approx 1$ whereas $\beta \approx 1- c\lg r$
for some constant $c\approx 0.5$. (Note this formula is meaningful when the frequency ratio $r$ is not too large, i.e. when $ r < 10^{1/c} \approx 100$.) Thanks to this logarithm, the infrequent sense is not swamped out in the embedding, even if it is 50 times less frequent than the dominant sense. This is an important reason behind the success of our method for extracting word senses.</p>
<p>This experiment –to which we were led by our theoretical investigations– is very surprising
because the embedding is the solution to a complicated, nonconvex optimization, yet it behaves in such a striking linear way. You can read our paper for an intuitive explanation using our theoretical model from <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a>.</p>
<h2 id="extracting-word-senses-from-embeddings">Extracting word senses from embeddings</h2>
<p>The above experiment suggests that</p>
<script type="math/tex; mode=display">v_{tie} \approx \alpha_1 \cdot v_{ tie1} + \alpha_2 \cdot v_{tie2} + \alpha_3 \cdot v_{tie3} +\cdots \qquad (1)</script>
<p>but this alone is insufficient to mathematically pin down the senses, since $v_{tie}$ can be expressed in infinitely many ways as such a combination. To pin down the senses we will interrelate the senses of different words —for example, relate the “article of clothing” sense <em>tie1</em> with <em>shoe, jacket</em> etc.</p>
<p>The word senses <em>tie1, tie2,..</em> correspond to “different things being talked about” —in other words, different word distributions occuring around <em>tie</em>.
Now remember that <a href="http://128.84.21.199/abs/1502.03520v6">our earlier paper</a> described in
<a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a> gives an interpretation of “what’s being talked about”: it is called <em>discourse</em> and
it is represented by a unit vector in the embedding space. In particular, the theoretical model
of <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a> imagines a text corpus as being generated by a random walk on
discourse vectors. When the walk is at a discourse $c_t$ at time $t$, it outputs a few words using a loglinear distribution:</p>
<script type="math/tex; mode=display">\Pr[w~\mbox{emitted at time $t$}~|~c_t] \propto \exp(c_t\cdot v_w). \qquad (2)</script>
<p>One imagines there exists a “clothing” discourse that has high probability of outputting the <em>tie1</em> sense, and also of outputting related words such as <em>shoe, jacket,</em> etc.
Similarly there may be a “games/matches” discourse that has high probability of outputting <em>tie2</em> as well as <em>team, score</em> etc.</p>
<p>By equation (2) the probability of being output by a discourse is determined by the
inner product, so one expects that the vector for “clothing” discourse has high inner product with all of <em>shoe, jacket, tie1</em> etc., and thus can stand as surrogate for $v_{tie1}$ in expression (1)! This motivates the following global optimization:</p>
<blockquote>
<p>Given word vectors in $\Re^d$, totaling about $60,000$ in this case, a sparsity parameter $k$,
and an upper bound $m$, find a set of unit vectors $A_1, A_2, \ldots, A_m$ such that
<script type="math/tex">v_w = \sum_{j=1}^m\alpha_{w,j}A_j + \eta_w \qquad (3)</script>
where at most $k$ of the coefficients $\alpha_{w,1},\dots,\alpha_{w,m}$ are nonzero (so-called <em>hard sparsity constraint</em>), and $\eta_w$ is a noise vector.</p>
</blockquote>
<p>Here $A_1, \ldots A_m$ represent important discourses in the corpus, which
we refer to as <em>atoms of discourse.</em></p>
<p>Optimization (3) is a surrogate for the desired expansion of $v_{tie}$ in (1) because one can hope that the atoms of discourse will contain atoms corresponding to <em>clothing</em>, <em>sports matches</em> etc. that will have high inner product (close to $1$) with <em>tie1,</em> <em>tie2</em> respectively. Furthermore, restricting $m$ to be much smaller than the number of words ensures that each atom needs to be used for multiple words, e.g., reuse the “clothing” atom
for <em>shoes</em>, <em>jacket</em> etc. as well as for <em>tie</em>.</p>
<p>Both $A_j$’s and $\alpha_{w,j}$’s are unknowns in this optimization. This is nothing but <em>sparse coding,</em> useful in neuroscience, image processing, computer vision, etc. It is nonconvex and computationally NP-hard in the worst case, but can be solved quite efficiently in practice using something called the k-SVD algorithm described in <a href="http://www.cs.technion.ac.il/~elad/publications/others/PCMI2010-Elad.pdf">Elad’s survey, lecture 4</a>. We solved this problem with sparsity
$k=5$ and using $m$ about $2000$. (Experimental details are in the paper. Also, some theoretical
analysis of such an algorithm is possible; see this <a href="http://www.offconvex.org/2016/05/08/almostconvexitySATM/">earlier post</a>.)</p>
<h1 id="experimental-results">Experimental Results</h1>
<p>Each discourse atom defines via (2) a distribution on words, which due to the exponential appearing in (2) strongly favors words whose embeddings have a larger inner product with it. In practice, this distribution is quite concentrated on as few as 50-100 words, and the “meaning” of a discourse atom can be roughly determined by looking at a few nearby words. This is how we visualize atoms in the figures below. The first figure gives a few representative atoms of discourse.</p>
<p style="text-align:center;">
<img src="http://www.cs.princeton.edu/~arora/pubs/discourseatoms.jpg" alt="A few of the 2000 atoms of discourse found" />
</p>
<p>And here are the discourse atoms used to represent two polysemous words, <em>tie</em> and <em>spring</em></p>
<p style="text-align:center;">
<img src="http://www.cs.princeton.edu/~arora/pubs/atomspolysemy.jpg" alt="Discourse atoms expressing the words tie and spring." />
</p>
<p>You can see that the discourse atoms do correspond to senses of these words.</p>
<p>Finally, we also have a technique that, given a target word, generates representative sentences according to its various senses as detected by the algorithm. Below are the sentences returned for
<em>ring.</em> (N.B. The mathematical meaning was missing in WordNet but was picked up by our method.)</p>
<p style="text-align:center;">
<img src="http://www.cs.princeton.edu/~arora/pubs/repsentences.jpg" alt="Representative sentences for different senses of the word ring." />
</p>
<h2 id="a-new-testbed-for-testing-comprehension-of-word-senses">A new testbed for testing comprehension of word senses</h2>
<p>Many tests have been proposed to test an algorithm’s grasp of word senses. They often involve
hard-to-understand metrics such as distance in WordNet, or sometimes tied to performance on specific applications like web search.</p>
<p>We propose a new simple test –inspired by word-intrusion tests for topic coherence
due to <a href="https://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf">Chang et al 2009</a>– which has the advantages of being easy to understand, and can also be administered to humans.</p>
<p>We created a testbed using 200 polysemous words and their 704 senses according to WordNet. Each “sense” is represented by a set of 8 related words; these were collected from WordNet and online dictionaries by college students who were told to identify most relevant other words occurring in the online definitions of this word sense as well as in the accompanying illustrative sentences. These 8 words are considered as <em>ground truth</em> representation of the word sense: e.g., for the “tool/weapon” sense of <em>axe</em> they were: <em>handle, harvest, cutting, split, tool, wood, battle, chop.</em></p>
<blockquote>
<p><strong>Police line-up test for word senses:</strong> the algorithm is given a random one of these 200 polysemous words and a set of $m$ senses which contain the true sense for the word as well as some <em>distractors,</em> which are randomly picked senses from other words in the testbed. The test taker has to identify the word’s true senses amont these $m$ senses.</p>
</blockquote>
<p>As usual, accuracy is measured using <em>precision</em> (what fraction of the algorithm/human’s guesses
were correct) and <em>recall</em> (how many correct senses were among the guesses).</p>
<p>For $m=20$ and $k=4$, our algorithm succeeds with precision $63\%$ and recall $70\%$, and performance remains reasonable for $m=50$. We also administered the test to a group of grad students.
Native English speakers had precision/recall scores in the $75$ to $90$ percent range.
Non-native speakers had scores roughly similar to our algorithm.</p>
<p>Our algorithm works something like this: If $w$ is the target word, then take all discourse atoms
computed for that word, and compute a certain similarity score between each atom and each of the $m$ senses, where the words in the senses are represented by their word vectors. (Details are in the paper.)</p>
<h2 id="takeaways">Takeaways</h2>
<p>Word embeddings have been useful in a host of other settings, and now it appears that
they also can easily yield different senses of a polysemous word. We have some subsequent applications of these ideas to other previously studied settings, including topic models, creating
WordNets for other languages, and understanding the semantic content of fMRI brain measurements. I’ll describe some of them in future posts.</p>
Sun, 10 Jul 2016 03:30:00 -0700
http://offconvex.github.io/2016/07/10/embeddingspolysemy/
http://offconvex.github.io/2016/07/10/embeddingspolysemy/A Framework for analysing Non-Convex Optimization<p>Previously <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong’s post</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben’s post</a> show that (noisy) gradient descent can converge to <em>local</em> minimum of a non-convex function, and in (large) polynomial time (<a href="http://arxiv.org/abs/1503.02101">Ge et al.’15</a>). This post
describes a simple framework that can sometimes be used to design/analyse algorithms that can quickly reach an approximate <em>global</em> optimum of the nonconvex function. The framework —which was used to analyse alternating minimization algorithms for sparse coding in <a href="http://arxiv.org/abs/1503.00778">our COLT’15 paper with Ge and Moitra</a>—generalizes many other sufficient conditions for convergence (usually gradient-based) that were formulated in recent papers.</p>
<h2 id="measuring-progress-a-simple-lyapunov-function">Measuring progress: a simple Lyapunov function</h2>
<p>Let $f$ be the function being optimized and suppose the algorithm produces a sequence of candidate solutions $z_1,\dots,z_k,\dots,$ via some update rule
<script type="math/tex">z_{k+1} = z_k - \eta g_k.</script></p>
<p>This can be seen as a dynamical system (see <a href="http://www.offconvex.org/2016/04/04/markov-chains-dynamical-systems/">Nisheeth’s</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben’s</a> posts related to dynamical systems).
Our goal is to show that this sequence converges to (or gets close to) a target point $z^* $, which is a global optimum of $f$. Of course, the algorithm doesn’t know $z^*$.</p>
<p>To design a framework for proving convergence it helps to indulge in daydreaming/wishful thinking: what property would we <em>like</em> the updates to have, to simplify our job?</p>
<p>A natural idea is to define a Lyapunov function $V(z)$ and show that: (i) $V(z_k)$ decreases to $0$ (at a certain speed) as $k\rightarrow \infty$; (ii) when $V(z)$ is close to $0$, then $z$ is close to $z^* $. (Aside: One can imagine more complicated ways of proving convergence, e.g., show $V(z_k)$ ultimately goes to $0$ even though it doesn’t decrease in every step. Nesterov’s acceleration method uses such a progress measure.)</p>
<p>Consider possibly the most trivial Lyapunov function, the (squared) distance to the target point, $V(z) = |z-z^*|^2$. This is also used in the standard convergence proof for convex functions, since moving in the opposite direction to the gradient can be shown to reduce this measure $V()$.</p>
<p>Even when the function is nonconvex, there always <em>exist</em> update directions that reduce this $V()$ (though finding them may not be easy). Simple algebraic manipulation shows that when the learning rate $\eta$ is small enough, then for $V(z_{k+1}) \le V(z_k)$, it is necessary and sufficient to have $\langle g_k, z_k-z^* \rangle \ge 0$.</p>
<p><img src="http://www.cs.princeton.edu/~tengyu/angle_for_blog_post.png" alt="correlation condition" style="float:left" height="240.8" width="260" /> As illustrated in the figure on the left, $z^* - z_k$ is the ideal direction that we desire to move to, and $-g_k$ is the direction that we actually move to. To establish convergence, it suffices to verify that the direction of movement is positively correlated with the desired direction.</p>
<p>To get quantitative bounds on running time, we need to ensure that $V(z_k)$ not only decreases, but does so rapidly. The next condition formalizes this: intuitively speaking it says that $-g_k$ and $z^*-z_k$ make an angle strictly less than 90 degrees.</p>
<blockquote>
<p><strong>Correlation Condition</strong>: The direction $g_k$ is $(\alpha,\beta,\epsilon_k)$-correlated with $ z^* $ if
<script type="math/tex">\langle g_k,z_k-z^* \rangle \ge \alpha \|z_k-z^*\|^2 + \beta \|g_k\|^2 -\epsilon_k</script></p>
</blockquote>
<p>This may look familiar to experts in convex optimization: as a special case if we make the update direction $g_k$ stand for the (negative) gradient, then the condition yields familiar notions such as strong convexity and smoothness. But the condition allows $g_k$ to not be the gradient, and in addition, allows the error term $\epsilon_k$, which is necessary in some applications to accommodate non-convexity and/or statistical error.</p>
<p>If the algorithm can at each step find such update directions, then the familiar convergence proof of convex optimization can be modified to show rapid convergence here as well, except the convergence is <em>approximate</em>, to some point in the neighborhood of $z^*$.</p>
<blockquote>
<p><strong>Theorem:</strong> Suppose $g_k$ satisfies the Correlation Condition above for every $k$, then with learning rate $\eta \le 2\beta$, we have
<script type="math/tex">\| z_k-z^* \|^2 \le (1-\alpha\eta)^k\| z_0-z^* \|^2 + \max_k \epsilon_k/\alpha</script></p>
</blockquote>
<h3 id="comparison-with-related-conditions">Comparison with related conditions</h3>
<p>As mentioned, the “wishful thinking” approach has been used to identify other
conditions under which specific nonconvex optimizations can be carried out to near-optimality: (<a href="https://arxiv.org/abs/1212.0467">JNS’13</a>, <a href="http://arxiv.org/abs/1312.0925">Hardt’14</a>, <a href="http://arxiv.org/abs/1408.2156">BWY’14</a>, <a href="http://arxiv.org/abs/1407.1065">CLS’15</a>, <a href="http://arxiv.org/abs/1503.00778">AGMM’15</a>, <a href="http://arxiv.org/abs/1411.8003">SL’15</a>, <a href="http://arxiv.org/abs/1505.05114">CC’15</a>, <a href="https://papers.nips.cc/paper/5733-a-nonconvex-optimization-framework-for-low-rank-matrix-estimation">ZWL’15</a>). All of these can be seen as some weakening of convexity (with the exception of the analysis for matrix completion in <a href="http://arxiv.org/abs/1312.0925">Hardt’14</a> which views the updates as noisy power method).</p>
<p>Our condition appears to contain most if not all of these as special cases.</p>
<p>Often the update direction $g_k$ in these papers is related to the gradient. For example using the gradient instead of
$g_k$ in our correlation condition turns it into the “regularity condition” proposed by <a href="http://arxiv.org/abs/1407.1065">CLS’15</a> for analyzing Wirtinger flow algorithm for phase retrieval.
The gradient stability condition in <a href="http://arxiv.org/abs/1408.2156">BWY’14</a> is also a special case, where $g_k$ is required to be close enough to $\nabla h(z_k)$ for some convex $h$ such that $z^* $ is the optimum of $h$. Then since $\nabla h(z_k)$ has angle < 90 degrees with $z_k-z^*$ (which follows from convexity of $h$), it implies that $g_k$ also does.</p>
<p>The advantage of our framework is that it encourages one to think of algorithms where $g_k$ is not the gradient. Thus applying the framework doesn’t require understanding the behavior of the gradient on the entire landscape of the objective function; instead, one needs to understand the update direction (which is under the algorithm designer’s control) at the
sequence of points actually encountered while running the algorithm.</p>
<p>This slight change of perspective may be powerful.</p>
<h2 id="application-to-sparse-coding">Application to Sparse Coding</h2>
<p>A particularly useful situation for applying the framework above is where the objective function has two sets of arguments and it is feasible to optimize one set after fixing the other –leading to the familiar alternating minimization heuristic. Such algorithms are a good example of how one may try to do local-improvement without explicitly following the (full) gradient.
As mentioned, our framework was used to analyse
such alternating minimization for <a href="https://en.wikipedia.org/wiki/Neural_coding#Sparse_coding">sparse coding</a>.</p>
<p>In sparse coding, we are given a set of examples $Y = [y_1,\dots, y_N]\in \mathbb{R}^{d\times N}$, and are asked to find an over-complete basis $A = [a_1,\dots,a_m]$ (where “overcomplete” refers to the setting
$m > d$) so that each example $y_j$ can be expressed as a sparse linear combination of $a_i$’s. Therefore, the natural optimization problem with squared loss is that</p>
<script type="math/tex; mode=display">f(A,X) = \min_{A, \textrm{sparse } X} \|Y-AX\|^2</script>
<p>Here both the objective and the constraint set are not convex.
One could consider using $\ell_1$ regularization as a surrogate for sparsity, but the trouble will be that the regularization is neither smooth or strongly convex, and the standard techniques for dealing with $\ell_1$ penalty term in convex optimization cannot be easily applied due to non-convexity.</p>
<p>The standard alternating minimization algorithm (a close variant of the one proposed by <a href="http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf">Olshausen and Field 1997</a> as a neurally plausible explanation for V1, the human primary visual cortex) is as follows:</p>
<script type="math/tex; mode=display">X_{k+1} \longleftarrow \textrm{threshold}(A_k^{\top}Y)</script>
<script type="math/tex; mode=display">A_{k+1} \longleftarrow A_{k} - \eta \underbrace{\frac{\partial }{\partial A} f(A_k,X_{k+1})}_{G_k}</script>
<p>Here update for $X$ is the projection pursuit algorithm in sparse recovery (see <a href="http://www.springer.com/us/book/9781441970107">Elad’10</a> for background), which is supposed to give an approximation of the best fit for $X$ given the current $A$.</p>
<p>Sometimes alternating minimization algorithms need careful initialization, but in practice here it suffices to initialize $A_0$ using a random sample of datapoints $y_i$’s.</p>
<p>However, it remains an open problem to analyse convergence using such random initialization; our analysis uses a special starting point $A_0$ found using spectral methods.</p>
<h3 id="applying-our-framework">Applying our framework</h3>
<p>At first glance, the mysterious aspect of our framework was how the algorithm can find an update direction correlating with $z_k -z^* $, without knowing $z^* $? In context of sparse coding, this comes about as follows: if we assume a probabilistic generative model for the observed data (namely, it was generated using some ground-truth sparse coding) then the alternating minimization automatically comes up with such update directions!</p>
<p>Specifically, we will assume that the data points $y_i$’s are generated using some ground truth dictionary $A^* $
using some ground truth $X^* $ whose columns are iid draws from some suitable distribution.
(One needs to assume some conditions on $A^* , X^* $, which are not important in the sketch below.) Note that the
entries within each column of $X^* $ are <em>not</em> mutually independent, otherwise the problem would be <a href="https://en.wikipedia.org/wiki/Independent_component_analysis">Independent Component Analysis</a>.</p>
<p>In line with our framework, we consider the Lyapunov function $V(A) = |A-A^* |_F^2$. Here the Frobenius norm $|\cdot|_F$ is also the Euclidean norm of the vectorized version of the matrix. Then our framework implies that to show quick convergence it suffices to verify the following (for some $\alpha,\beta > 0$) for the update direction $G_k$:</p>
<script type="math/tex; mode=display">\langle G_k, A_k-A^* \rangle \ge \alpha \|A_k - A^* \|_F^2 + \beta \|G_k\|_F^2 -\epsilon_k</script>
<p>In <a href="http://arxiv.org/abs/1503.00778">AGMM’15</a> we showed that under certain assumption on the true dictionary $A^* $ and the true coefficient $X^* $, the above inequality is indeed true with small $\epsilon_k$ and some constant $\alpha,\beta > 0$. The proof is a bit technical but reasonable — the partial gradient $\frac{\partial f}{\partial A}$ has a simple form and therefore $G_k$ has a closed form in $A_k$ and $Y$. Therefore, it boils down to plugging in the form of $G_k$ into the equation above and simplifying it appropriately. (One also needs the fact that
the starting $A_0$ obtained using spectral methods is somewhat close to $A^* $.)</p>
<p>We hope others will use our framework to analyse other nonconvex problems!</p>
<p><em>(Aside: We hope that readers will leave comments if they know of other frameworks for proving convergence that are not subcases of the above framework.)</em></p>
Sun, 08 May 2016 02:00:00 -0700
http://offconvex.github.io/2016/05/08/almostconvexitySATM/
http://offconvex.github.io/2016/05/08/almostconvexitySATM/Markov Chains Through the Lens of Dynamical Systems: The Case of Evolution<p>In this post, we will see the main technical ideas in the analysis of the mixing time of evolutionary Markov chains introduced in a previous <a href="http://www.offconvex.org/2016/03/07/evolution-markov-chains/">post</a>.
We start by introducing the notion of the <em>expected motion</em> of a stochastic process or a Markov chain.
In the case of a finite population evolutionary Markov chain, the expected motion turns out to be a dynamical system which corresponds to the infinite population evolutionary dynamics with the same parameters.
Surprisingly, we show that the limit sets of this dynamical system govern the mixing time of the Markov chain.
In particular, if the underlying dynamical system has a unique stable fixed point (as in asexual evolution), then the mixing is fast and in the case of multiple stable fixed points (as in sexual evolution), the mixing is slow.
Our viewpoint connects evolutionary Markov chains, <em>nature’s algorithms</em>, with stochastic descent methods, popular in machine learning and optimization, and the readers interested in the latter might benefit from our techniques.</p>
<h2 id="a-quick-recap">A Quick Recap</h2>
<p>Let us recall the parameters of the finite population evolutionary Markov chain (denoted by $\mathcal{M}$) we saw last time.
At any time step, the state of the Markov chain consists of a population of size $N$ where each individual could be one of $m$ types.
The mutation and the fitness matrices are denoted by $Q$ and $A$ respectively.
$X^{(t)}$ captures, after normalization by $N,$ the composition of the population is at time $t$.
Thus, $X^{(t)}$ is a point in the $m$-dimensional probability simplex $\Delta_m$.
Since we assumed that $QA>0$, the Markov chain has a stationary distribution $\pi$ over its state space, denoted by $\Omega \subseteq \Delta_m$; the state space has cardinality roughly $N^m$.
Thus, $X^{(t)}$ evolves in $\Delta_m$ and, with time, its distribution converges to $\pi$.
Our goal is to bound the time it takes for this distribution to stabilize, i.e., bound the mixing time of $\mathcal{M}$.</p>
<h2 id="the-expected-motion">The Expected Motion</h2>
<p>As a first step towards understanding the mixing time, let us compute the expectation of $X^{(t+1)}$ for a given $X^{(t)}$.
This function tells us where we expect to be after one time step given the current state; in <a href="http://theory.epfl.ch/vishnoi/Publications_files/PV16.pdf">this</a> paper we refer to this as the <em>expected motion</em> of this Markov chain (and define it formally for all Markov chains towards the end of this post).
An easy calculation shows that, for $\mathcal{M}$,</p>
<script type="math/tex; mode=display">\mathbb{E} \left[ X^{(t+1)} \; \vert \; X^{(t)} \right] = \frac{QA X^{(t)}}{\|QAX^{(t)}\|_1}=: f(X^{(t)}).</script>
<p>This $f$ is the same function that was introduced in the previous post for the <em>infinite</em> population evolutionary dynamics with the same parameters!
Thus, in each time step, the expected motion of the Markov chain is governed by $f$.
Surprisingly, something stronger is true: we can prove (see Section 3.2 <a href="http://theory.epfl.ch/vishnoi/Publications_files/PSVSODA16.pdf">here</a>) that, given some $X^{(t)},$ the point $X^{(t+1)}$ can be equivalently obtained by taking $N$ i.i.d. samples from $f(X^{(t)})$.
In words,</p>
<script type="math/tex; mode=display">f \; \; \mbox{guides} \; \; \mathcal{M}.</script>
<blockquote>
<p>In fact, a moment’s thought tells us that this phenomenon transcends any specific model of evolution.
We can fix <strong>any</strong> dynamical system $g$ over the simplex and define a Markov chain guided by it as follows: If $X^{(t)}$ is the population vector at time $t$, then define $X^{(t+1)}$ as the population vector obtained by taking $N$ i.i.d. (or even correlated) copies from $g(X^{(t)})$. By design, $g$ is the expected motion of this Markov chain.</p>
</blockquote>
<h2 id="evolution-on-finite-populations--noisy-evolution-on-infinite-populations">Evolution on Finite Populations = Noisy Evolution on Infinite Populations</h2>
<p>The above observation allows us to view our evolutionary Markov chain as a noisy version of the deterministic, infinite population evolution.
A bit more formally, there are implicitly defined random variables $\zeta_{s}^{(t+1)}$ for $1 \leq s \leq N$ and all $t$, such that</p>
<script type="math/tex; mode=display">X^{(t+1)} = f(X^{(t)}) + \frac{1}{N} \sum_{s=1}^N \zeta_s^{(t+1)}.</script>
<p>Here, $\zeta_s^{(t+1)}$ for $1\leq s \leq N$ is a random vector that corresponds to the <em>error</em> or <em>noise</em> of sample $s$ at the $t$-th time step.
Formally, because $f$ is the expected motion of the Markov chain,
each $\zeta_s^{(t+1)}$ has expectation $0$ conditioned on $X^{(t)}$.
Further, the fact that $f$ guides $\mathcal{M}$ implies that for each $t$, when conditioned on $X^{(t)}$, the vectors $\zeta_{s}^{(t+1)}$ are i.i.d. for $1 \leq s \leq N$.
Without conditioning, we cannot say much about the $\zeta_{s}^{(t)}$s.
However, since we know that the state space of $\mathcal{M}$ lies in the simplex, we can deduce that $\Vert\zeta_s^{(t)}\Vert \leq 2$.
The facts that the expectation of the $\zeta_s^{(t)}$s are zero, they are independent and bounded imply that the variance of each coordinate of $\frac{1}{N} \sum_{s=1}^N \zeta_s^{(t+1)}$
(again conditioned on the past) is roughly $1/N$.</p>
<h2 id="connections-to-stochastic-gradient-descent">Connections to Stochastic Gradient Descent</h2>
<p>Now we draw an analogy of the evolutionary Markov chain to an old idea in optimization, stochastic gradient descent or <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a>.
However, we will see crucial differences that require the development of new tools.
Recall that in the SGD setting, one is given a function $F$ and the goal is to find a local minimum of $F.$
The gradient descent method moves from the current point $x^{(t)}$ to a new point $x^{(t+1)}=x^{(t)} - \eta \nabla F(x^{(t)})$ for some rate $\eta$ (which could depend on time $t$).<br />
Since the gradient may not be easy to compute, SGD substitutes the gradient at the current point by an unbiased estimator of the gradient.
Thus, the point at time $t$ becomes a random variable $X^{(t)}$. Since the estimate is unbiased, we may write it as</p>
<script type="math/tex; mode=display">\nabla F(X^{(t)}) - \zeta^{(t+1)},</script>
<p>where the expectation of $\zeta^{(t+1)}$ conditioned on $X^{(t)}$ is zero.
Thus, we can write one step of SGD as</p>
<script type="math/tex; mode=display">X^{(t+1)} = \left( X^{(t)} -\eta \nabla F(X^{(t)}) \right) +\eta \cdot \zeta^{(t+1)}.</script>
<p>Comparing it to our evolutionary Markov chain, it can be shown that $f(x)=\frac{QA x}{\Vert QA x\Vert_1}$ is a gradient system (i.e., $f=\nabla G$ for some function $G$) and we may think of the corresponding $\mathcal M$ as SGD with step-size $\eta=1/N$.</p>
<p>There is a vast literature understanding when SGD converges to the global optimum (for convex $F$) or a local optima (for <em>reasonable</em> non-convex $F$).
Why can’t we use techniques developed for SGD to analyze our evolutionary Markov chain?
To start with, when the step size does not go to zero with time, $X^{(t)}$ wanders around its domain $\Omega$ and will not converge to a point.
In the case when the step size is fixed, typically, the time <a href="http://arxiv.org/pdf/1306.2119.pdf">average</a> of $X^{(t)}$ is used in a hope that it will converge to a local minima of the function.
The Ergodic Theorem of Markov chains tells us that the time average will converge to the expectation of a sample drawn from $\pi$, the steady state distribution.
This quantity is the same as the zero of $\nabla F$ <em>only when it is a linear function</em> (equivalently $F$ is quadratic); <em>certainly not the case in our setting</em>.
Further, the rate of convergence to this expectation is governed by the mixing time of the Markov chain.
Thus, there is no getting around proving a bound on the mixing time.
Moreover, for biological applications (as described in our previous post), we need to know more than the expectation: we need to obtain samples from the steady state distribution $\pi$.
Finally, in several other evolutionary Markov chains of interest, the guiding dynamical system is <em>not a gradient system</em>.
Hence, the desired results in the setting of evolution seem beyond the reach of current techniques.</p>
<blockquote>
<p>The reason for taking this detour and making the connection to SGD is not only to show that completely different sounding problems and areas might be related, but also that the techniques we develop in analyzing evolutionary Markov chains find use in understanding SGD beyond the quadratic case.</p>
</blockquote>
<h2 id="the-landscape-of-the-expected-motion-governs-the-mixing-time">The Landscape of the Expected Motion Governs the Mixing Time</h2>
<p>Now we delve into our results and proof ideas.
We derive all of the information we need to bound the mixing time of $\mathcal M$ from the the limit sets of $f$ which guides it. Roughly, we show that when the limit set of $f$ consists of a unique stable fixed point (which is akin to convexity) as in asexual evolution, then the mixing is fast and in the case of multiple stable fixed points (which is akin to non-convexity) as in sexual evolution, the mixing is slow.</p>
<p>We saw in our first <a href="http://www.offconvex.org/2015/12/21/dynamical-systems-1/">post</a> that the dynamical system $f(x)=\frac{QAx}{\Vert QAx\Vert_1}$ corresponding to the case of asexual evolution has exactly one fixed point in the simplex, say $ x^\star$, when $QA$ is positive.
In fact, $x^\star$ is stable and, no matter where we initiate the dynamical system, it ends up close to $x^\star$ in a small number of iterations (which does not depend on $N$).</p>
<p>Back to mixing time: a generic technique to bound the mixing time of a Markov chain employs a <em>coupling</em> of two copies of the chain $X^{(t)}$ and $Y^{(t)}$.</p>
<blockquote>
<p>A coupling of a Markov chain $\mathcal M$ is a function which takes as input $X^{(t)}$ and $Y^{(t)}$ and outputs $X^{(t+1)}$ and $Y^{(t+1)}$ such that each of $X^{(t+1)}$ and $Y^{(t+1)}$, when considered on their own, is a correct instantiation of one step of $\mathcal M$ from the states $X^{(t)}$ and $Y^{(t)}$ respectively. However, $X^{(t+1)}$ and $Y^{(t+1)}$ are allowed to be arbitrarily correlated.</p>
</blockquote>
<p>For example, we could couple $X^{(t)}$ and $Y^{(t)}$ such that if $X^{(t)} = Y^{(t)}$ then $X^{(t+1)}=Y^{(t+1)}$. More generally, we can consider the <em>distance</em> between $X^{(t)}$ and $Y^{(t)}$, and consider a coupling that contracts the distance between them. If this distance is contractive by, say, a factor of $\rho<1$ at every time step, then the number of iterations required to reduce distance below $1/N$ is about $\log_{1/\rho} N$; this roughly upper bounds the mixing time.</p>
<p style="text-align:center;">
<img src="/assets/coupling.jpg" alt="" />
</p>
<p>The key observation that connects the dynamical system $f$ and our Markov chain is that using the function $f$ we can construct a coupling $\mathcal{C}$ such that for all $x$,$y \in \Omega$,</p>
<script type="math/tex; mode=display">\mathbb{E}_{\mathcal{C}}\left[\|{X}^{(t+1)}-{Y}^{(t+1)}\|_1 \; | \; {X}^{(t)}=x, {Y}^{(t)}=y \right]=\|f(x)-f(y)\|_1.</script>
<p>Thus, if $ \Vert f(x)-f(y)\Vert_1 < \rho \cdot \Vert x-y\Vert_1 <1$ for some $\rho<1$ and <em>all</em> $x,y \in \Omega$, we would be done.
The bad news is that we can show that there are $x,y$ for which $\Vert f(x)-f(y)\Vert_1 > \Vert x-y \Vert_1$ implying that there is no contractive coupling for all $x$ and $y.$</p>
<blockquote>
<p>What about when $x$ and $y$ are close to $x^\star$?</p>
</blockquote>
<p>In this case, by a first order Taylor approximation of the dynamical system $f$, we can bound the contraction $(\rho)$ by the $1 \rightarrow 1$ norm of the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian</a> of $f$ at $x^\star$.
However, this quantity is less than one <a href="http://dl.acm.org/citation.cfm?id=2722129.2722234">only</a> when $m=2$, see <a href="http://dl.acm.org/citation.cfm?id=2722129.2722234">here</a>.
For larger $m$, we have to go back to our intuition from dynamical systems and, using the fact that all trajectories of $f$ converge to $x^\star$, argue that the appropriate norm of the Jacobian of $f^k$ (i.e., $f$ applied $k$ times) is contractive.
While there are a few technical challenges, we can use $f^k$ to construct a contractive coupling. We then use concentration to handle the case when $x$,$y$ are not close to $x^\star$,
see <a href="http://theory.epfl.ch/vishnoi/Publications_files/PSVSODA16.pdf">here</a> for the details.
As a consequence, we obtain a mixing time of $O(\log N)$ (suppressing other parameters).
Thus, in the world of asexual evolution the steady state can be reached quickly!</p>
<h2 id="markov-chains-guided-by-dynamical-systems---beyond-uniqueness">Markov Chains Guided by Dynamical Systems - Beyond Uniqueness</h2>
<p>Interestingly, this proof does not use any property of $f$ other than that it has a unique fixed point which is stable.
However, in many cases, such as sexual evolution (see <a href="http://theory.epfl.ch/vishnoi/Publications_files/PV16.pdf">here</a> for the model of sexual evolution or an equivalent model for how <em>children acquire grammar</em>, see <a href="http://science.sciencemag.org/content/291/5501/114">here</a>) and <a href="http://www.sciencedirect.com/science/article/pii/S0022519303931997">here</a>, the expected motion has multiple fixed points - some stable and some unstable.
Such a dynamical system is inherently non-convex - trajectories starting at different points could converge to different points.
Further, the presence of unstable fixed points can slow down trajectories and, hence, the mixing time.
In <a href="http://theory.epfl.ch/vishnoi/Publications_files/PV16.pdf">this</a> paper, we give a comprehensive treatment about how the landscape of the limit sets determines the mixing time of evolutionary Markov chains.
In a nutshell, while the presence of unstable fixed points does not seem to affect the mixing time, the presence of two stable fixed points results in the mixing time being $\exp(N)$!</p>
<blockquote>
<p>This result allows us to prove a phase transition in the mixing time for an evolutionary Markov chain with sex where, changing the mutation parameter changes the geometry of the limit sets of the expected motion from multiple stable fixed points to unique stable fixed point.</p>
</blockquote>
<h2 id="evolution-on-structured-populations">Evolution on Structured Populations?</h2>
<p>A challenging problem left open by our work is to try to estimate the mixing time of evolutionary dynamics on <em>structured</em> populations which arise in ecology.
Roughly, this setting extends the evolutionary models discussed thus far by introducing an
additional input parameter, a graph on $N$ vertices.</p>
<blockquote>
<p>The graph
provides structure to the population by locating each individual at a
vertex, and, at time $t+1$, a vertex determines its type by sampling with replacement from among its neighbors in the graph at time $t$; see <a href="http://www.nature.com/nature/journal/v433/n7023/full/nature03204.html">this</a> paper for more details.</p>
</blockquote>
<p>The model we discussed so far can be seen as a special case when the underlying graph is the complete graph on $N$ vertices.
The difficulty is two fold: now it is no longer sufficient to keep track of the number of each type and also the variance of the noise is no longer $1/N$ - it could be large if a vertex has small degree.</p>
<h2 id="the-expected-motion-revisited">The Expected Motion Revisited</h2>
<p>Now we formally define the expected motion of any Markov chain with respect to a function $\phi$ from its state space $\Omega$ to $\mathbb{R}^n$.
If $X^{(t)}=x$ is the state of the Markov chain at time $t$ and $X^{(t+1)}$ its state at time $t+1,$ then the expected motion of $\phi$ for the chain at $x$ is</p>
<script type="math/tex; mode=display">\mathbb{E}\left[\phi(X^{(t+1)}) \;| \;X^{(t)} =x \right]</script>
<p>where the expectation is taken over one step of the chain.
Often, and in the application we presented in this post, the state space $\Omega$ already has a geometric structure and is a subset of $\mathbb{R}^n$.
In this case, there is a canonical expected motion which corresponds to $\phi$ being just the identity map.</p>
<blockquote>
<p>What can the expected motion of a Markov chain tell us about the Markov chain itself?</p>
</blockquote>
<p>Of course, without imposing additional structure on the Markov chain or $\phi$, the answer is unlikely to be very interesting. However, the results in this post suggest that thinking of a Markov chain in this way can be quite useful.</p>
<h2 id="to-conclude-">To Conclude …</h2>
<p>In this post, hopefully, you got a flavor of how techniques from dynamical systems can be used to derive interesting properties of Markov chains and stochastic processes.
We also saw that nature’s methods, in the context of evolution, seem quite close to the methods of choice of humans - <em>is this a coincidence</em>?
In a future post, we will show another <a href="http://arxiv.org/abs/1601.02712">example</a> of this phenomena - the famous iteratively reweighted least squares (IRLS) in sparse recovery turns out to be identical to the dynamics of an organism found in nature - the slime mold.</p>
Mon, 04 Apr 2016 14:00:00 -0700
http://offconvex.github.io/2016/04/04/markov-chains-dynamical-systems/
http://offconvex.github.io/2016/04/04/markov-chains-dynamical-systems/Saddles Again<p>Thanks to Rong for the <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">very nice blog post</a> describing critical points of nonconvex functions and how to avoid them. I’d like to follow up on his post to highlight a fact that is not widely appreciated in nonlinear optimization. Though we often teach the contrary in our intro courses, it is in fact super hard to converge to a saddle point. (Just look at those pictures in Rong’s post! If you move ever so slightly you fall off the saddle). Even simple algorithms like gradient descent with constant step sizes can’t converge to saddle points unless you try really hard.</p>
<h2 id="its-hard-to-converge-to-a-saddle">It’s hard to converge to a saddle.</h2>
<p>To illustrate why gradient descent would not converge to a non-minimizing saddle points, consider the case of a non-convex quadratic, $f(x)=\frac{1}{2} \sum_{i=1}^d a_i x_i^2$. Assume that $a_i$ is nonnegative for the $k$ values and is strictly negative for the last $d-k$ values. The unique stationary point of this problem is $x=0$. The Hessian at $0$ is simply the diagonal matrix with $H_{ii} = a_i$ for $i=1,\ldots,d$.</p>
<p>Now what happens when we run gradient descent on this function from some initial point $x^{(0)}?$ The gradient method has iterates of the form</p>
<script type="math/tex; mode=display">x^{(k+1)} = x^{(k)} - t \nabla f(x^{(k)})\,.</script>
<p>For our function, this takes the form</p>
<script type="math/tex; mode=display">x^{(k+1)}_i = (1- t a_i) x_i^{(k)}</script>
<p>If one unrolls this recursive formula down to zero, we see that the $i$th coordinate of the $k$th iterate is given by the formula</p>
<script type="math/tex; mode=display">x_{i}^{(k)} = (1-t a_i)^{k} x_i^{(0)}\,.</script>
<p>One can immediately see from this expression that if the step size $t$ is chosen such
that $t |a_i| < 1 $
for all $i$, then when all of the $a_i$
are nonnegative, the algorithm converges to a point where the gradient is equal to zero from any starting point. But if there is <em>a single negative $a_i$</em>, the function diverges to negative infinity exponentially quickly from any randomly chosen starting point.</p>
<p>The random initialization is key here. If we initialized the problem such that $x^{(0)}_i=0$ whenever $a_i<0$, then the algorithm would actually converge. However, under the smallest perturbation away from this initial condition, gradient descent diverges to negative infinity.</p>
<p>Most of the examples showing that algorithms converge to stationary points are fragile in a similar way. You have to try very hard to make an algorithm converge to a saddle point. As an example of this phenomena for a non-quadratic function, consider the following example from Nesterov’s revered <a href="http://www.springer.com/us/book/9781402075537">Introductory Lectures on Convex Optimization</a>. Let $f(x,y) = \frac12 x^2 +\frac14 y^4-\frac12 y^2$. The critical points of this function are $z^{(1)}= (0,0)$, $z^{(2)} = (0,-1)$ and $z^{(3)} = (0,1)$. The points $z^{(2)}$ and $z^{(3)}$ are local minima, and $z^{(1)}$ is a saddle point. Now observe that gradient descent initialized from any point of the form $(x,0)$ converges to the saddle point $z^{(1)}$. <em>From any other initial point,</em> gradient descent converges to a local minimum. If one chooses an initial point at random, then gradient descent does not converge to a saddle point <em>with probability one.</em></p>
<h3 id="the-stable-manifold-theorem-and-random-initialization">The Stable Manifold Theorem and random initialization</h3>
<p>In recent work with <a href="http://arxiv.org/abs/1602.04915">Jason Lee, Max Simchowitz, and Mike Jordan</a>, we made this result precise using the Stable Manifold Theorem from dynamical systems. The Stable Manifold theorem is concerned with fixed point operations of the form $x^{(k+1)} = \Psi(x^{(k)})$. It quantifies that the set of points that locally converge to a fixed point $x^{\star}$ of such an iteration have measure zero whenever the Jacobian of $\Psi$ at $x^{\star}$ has eigenvalues bigger than 1.</p>
<p>With a fairly straightforward argument, we were able to show that the gradient descent algorithm satisfied the assumptions of the Stable Manifold Theorem, and, moreover, that the set of points that converge to strict saddles <em>always</em> has measure zero. This formalizes the above argument. If you pick a point at random and run gradient descent, you will never converge to a saddle point. While this doesn’t give a precise rate on the number of iterations, we show that if all of the local minima satisfy the <a href="https://regularize.wordpress.com/2013/09/25/the-kurdyka-lojasiewicz-inequality-and-gradient-descent-methods/">Kurdyka-Lojasiewicz inequality</a>, then one can derive quantitative convegence rates.</p>
<p>In some sense, optimizers would not be particularly surprised by this theorem. We are sure that some version of our result is already known for gradient descent, but we couldn’t find it in the literature. If you can find an earlier reference proving this theorem we would be delighted if you’d let us know.</p>
<h3 id="adding-noise">Adding noise</h3>
<p>As Rong <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">discussed</a>, in his paper with Huang, Jin, and Yuan, adding gaussian noise to the gradient helps to avoid saddle points. In particular, they introduce the notion <em>strict saddle</em> functions to be those where all saddle points are either local minima or have Hessians with negative eigenvalues bounded away from 0. As we saw above, if a saddle point has negative eigenvalues, the set of initial conditions that converge to that point has measure zero. But when we add noise to the gradient, there are <em>no initial conditions</em> that converge to saddles. The noise immediately pushes you off this low-dimensional manifold.</p>
<p>Interestingly, a similar result also follows from the Stable Manifold Theorem. Indeed, Robin Pemantle <a href="https://www.math.upenn.edu/~pemantle/papers/nonconvergence.pdf">developed a more general result</a> for stochastic processes. Pemantle uses the Stable Manifold Theorem to show that general vector flows perturbed by noise cannot converge to unstable fixed points. As a special case, he proves that stochastic gradient descent cannot converge to a saddle point provided the gradient noise is sufficiently diverse. In particular, this implies that additive gaussian noise is sufficient to prevent convergence to saddles.</p>
<p>Pemantle does not have to assume the strict saddle point condition to prove his theorem. However, additional work would be required to extract the sort of quantitative convergence bounds that Rong and his coauthors derive from Pemantle’s argument.</p>
<h2 id="what-makes-nonconvex-optimization-difficult">What makes nonconvex optimization difficult?</h2>
<p>If saddle points are easy to avoid, then the question remains as to what exactly makes nonconvex optimization difficult? In my next post, I’ll explore why this question is so challenging, describing some apparently innocuous problems in optimization that are deviously difficult.</p>
Thu, 24 Mar 2016 02:00:00 -0700
http://offconvex.github.io/2016/03/24/saddles-again/
http://offconvex.github.io/2016/03/24/saddles-again/Escaping from Saddle Points<p>Convex functions are simple — they usually have only one local minimum. Non-convex functions can be much more complicated. In this post we will discuss various types of <em>critical points</em> that you might encounter when you go <em>off the convex path</em>. In particular, we will see in many cases simple heuristics based on gradient descent can lead you to a <em>local minimum</em> in polynomial time.</p>
<h2 id="various-types-of-critical-points">Various Types of Critical Points</h2>
<p><img src="/assets/saddle/minmaxsaddle.png" alt="Local Minimum, Local Maximum and Saddle Point" /></p>
<p>To minimize the function $f:\mathbb{R}^n\to \mathbb{R}$, the most popular approach is to follow the opposite direction of the gradient $\nabla f(x)$ (for simplicity, all functions we talk about are infinitely differentiable), that is,</p>
<script type="math/tex; mode=display">y = x - \eta \nabla f(x),</script>
<p>Here $\eta$ is a small step size. This is the <em>gradient descent</em> algorithm.</p>
<p>Whenever the gradient $\nabla f(x)$ is nonzero, as long as we choose a small enough $\eta$, the algorithm is guaranteed to make <em>local</em> progress. When the gradient $\nabla f(x)$ is equal to $\vec{0}$, the point is called a <strong>critical point</strong>, and gradient descent algorithm will get stuck. For (strongly) convex functions, there is a unique <em>critical point</em> that is also the <em>global minimum</em>.</p>
<p>However, for non-convex functions, just having the gradient to be $\vec{0}$ is not good enough. A simple example is the function</p>
<script type="math/tex; mode=display">y = x_1^2 - x_2^2.</script>
<p>At $x = (0,0)$, the gradient is $\vec{0}$, but it is clearly not a local minimum as $x = (0, \epsilon)$ has smaller function value. The point $(0,0)$ is called a <em>saddle point</em> of this function.</p>
<p>To distinguish these cases we need to consider the second order derivative $\nabla^2 f(x)$ — an $n\times n$ matrix (usually known as the <em>Hessian</em>) whose $i,j$-th entry is equal to $\frac{\partial^2}{\partial x_i \partial x_j} f(x)$. When the Hessian is positive definite (which means $u^\top\nabla^2 f(x) u > 0$ for any $u\ne 0$), by second order Taylor’s expansion for any direction $u$
<script type="math/tex">f(x + \eta u) \approx f(x) + \frac{\eta^2}{2} u^\top\nabla^2 f(x) u > f(x),</script>
therefore $x$ must be a local minimum. Similarly, when the Hessian is negative definite, the point is a local maximum; when the Hessian has both positive and negative eigenvalues, the point is a <em>saddle point</em>.</p>
<p>It is believed that for many problems including <a href="http://arxiv.org/abs/1412.0233">learning deep nets</a>, almost all local minimum have very similar function value to the global optimum, and hence finding a local minimum is good enough. However, it is NP-hard to even find a local minimum (see Discussions in <a href="http://arxiv.org/abs/1602.05908">Anandkumar, Ge 2006</a>). Many popular optimization techniques in practice are <em>first order</em> optimization algorithms: they only look at the gradient information, and never explicitly compute the Hessian. Such algorithms may get <em>stuck</em> at saddle points.</p>
<p>In the rest of the post, we will first see that getting stuck at saddle points is a very realistic possibility since most natural objective functions have <em>exponentially</em> many saddle points. We will then discuss how optimization algorithms can try to escape from saddle points.</p>
<h2 id="symmetry-and-saddle-points">Symmetry and Saddle Points</h2>
<p>Many learning problems can be abstracted as searching for $k$ distinct <em>components</em> (sometimes called <em>features</em>, <em>centers</em>,…). For example, in the <a href="https://en.wikipedia.org/wiki/Cluster_analysis">clustering</a> problem, there are $n$ points, and we are searching for $k$ components that minimizes the sum of distances of points to their nearest center. In a two-layer <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">neural network</a>, we try to find a network with $k$ distinct <em>neurons</em> at the middle layer. In my <a href="http://www.offconvex.org/2015/12/17/tensor-decompositions/">previous post</a> I talked about <em>tensor decomposition</em>, which also looks for $k$ distinct <em>rank-1 components</em>.</p>
<p>A popular way to solve these problems is to design an objective function: let $x_1, x_2, \ldots, x_k \in \mathbb{R}^n$ denote the desired centers and let objective function $f(x_1,…,x_k)$ measure the quality of the solution. The function is minimized when the vectors $x_1,x_2,…,x_k$ are the $k$ components that we are looking for.</p>
<p>A natural reason why any such problem is inherently non-convex is <em>permutation symmetry</em>. For instance, if we swap the order of first and second component, the solutions are equivalent. Namely,
<script type="math/tex">f(x_1,x_2,...,x_k) = f(x_2, x_1,...,x_k).</script></p>
<p>However, if we take the average of this solution, we will end up with the solution $\frac{x_1+x_2}{2}, \frac{x_1+x_2}{2}, x_3,…,x_k$, which is <em>not equivalent</em>! If the original solution is optimal this average is likely to be suboptimal. Therefore the objective function cannot be convex because for convex functions, average of optimal solutions is still optimal.</p>
<p><img src="/assets/saddle/equivalent.png" alt="Symmetry" /></p>
<p>There are <em>exponentially</em> many globally optimal solutions that are all permutations of the same solution. Saddle points arise naturally on the paths that connect these <em>isolated</em> local minima. The figure below shows the function $y = x_1^4-2x_1^2 + x_2^2$: between two symmetric local min $(-1,0)$ and $(1,0)$, the point $(0,0)$ is a saddle point.</p>
<p style="text-align:center;">
<img src="/assets/saddle/symmetrysmall.png" alt="Symmetry and Saddle Points" />
</p>
<h2 id="escaping-from-saddle-points">Escaping from Saddle Points</h2>
<p>In order to optimize these non-convex functions with many saddle points, optimization algorithms need to make progress even at (or near) saddle points. The simplest way to do this is by using the second order Taylor’s expansion:</p>
<script type="math/tex; mode=display">% <![CDATA[
f(y) \approx f(x) + \left<\nabla f(x), y-x\right>+\frac{1}{2} (y-x)^\top \nabla^2 f(x) (y-x). %]]></script>
<p>If the gradient $\nabla f(x)$ is $\vec{0}$, we can still hope to find a vector $u$ where $u^\top \nabla^2 f(x) u < 0$. This way if we let $y = x+\eta u$, the function value of $f(y)$ is likely to be smaller. Many optimization algorithms such as <a href="http://link.springer.com/article/10.1007%2Fs10107-015-0893-2">trust region algorithms</a> and <a href="http://link.springer.com/article/10.1007%2Fs10107-006-0706-8">cubic regularization</a> use this idea, and they can escape from saddle points in polynomial time for nice functions.</p>
<h3 id="strict-saddle-functions">Strict Saddle Functions</h3>
<p>As we discussed, in general it is NP-hard to find a local minimum and many algorithms may get stuck at a saddle point. How many steps do we need to escape from a saddle point? This is related to how <em>well-behaved</em> the saddle points are. Intuitively, a saddle point $x$ is well-behaved, if there is a direction $u$ such that the second order term $u^\top \nabla^2 f(x) u$ is significantly smaller than 0 — geometrically this means there is a steep direction where the function value decreases. To quantify this, <a href="http://arxiv.org/abs/1503.02101">my paper with Furong Huang, Chi Jin and Yang Yuan</a> introduced the notion of <em>strict saddle</em> functions (also known as “ridable” function in <a href="http://arxiv.org/abs/1510.06096">Sun et al. 2015</a>)</p>
<blockquote>
<p>A function $f(x)$ is <em>strict saddle</em> if all points $x$ satisfy at least one of the following<br /></p>
<ol>
<li>Gradient $\nabla f(x)$ is large. <br /></li>
<li>Hessian $\nabla^2 f(x)$ has a negative eigenvalue that is bounded away from 0.<br /></li>
<li>Point $x$ is near a local minimum.</li>
</ol>
</blockquote>
<p>Essentially, the local region of every point $x$ looks like one of the following pictures:</p>
<p><img src="/assets/saddle/strictsaddle.png" alt="Symmetry" /></p>
<p>For such functions, <a href="http://link.springer.com/article/10.1007%2Fs10107-015-0893-2">trust region algorithms</a> and <a href="http://link.springer.com/article/10.1007%2Fs10107-006-0706-8">cubic regularization</a> can find a local minimum efficiently.</p>
<blockquote>
<p><strong>Theorem(Informal)</strong> There are polynomial time algorithms that can find a local minimum of strict saddle functions.</p>
</blockquote>
<p>What functions are strict saddle? <a href="http://arxiv.org/abs/1503.02101">Ge et al. 2015</a> showed a <a href="http://www.offconvex.org/2015/12/17/tensor-decompositions/">tensor decomposition</a> problem is strict saddle. <a href="http://arxiv.org/abs/1510.06096">Sun et al. 2015</a> observed that problems like complete <a href="https://en.wikipedia.org/wiki/Machine_learning#Sparse_dictionary_learning">dictionary learning</a>, <a href="https://en.wikipedia.org/wiki/Phase_retrieval">phase retrieval</a> are also strict saddle.</p>
<h3 id="first-order-method-to-escape-from-saddle-points">First Order Method to Escape from Saddle Points</h3>
<p>Trust region algorithms are very powerful. However they need to compute the second order derivative of the objective function, which is often too expensive in practice. If the algorithm can only access the gradient of the function, is it still possible to escape from saddle points?</p>
<p>This might seem hard as the gradient at a saddle point is $\vec{0}$ and does not give us any information. However, the key observation here is saddle points are very <em>unstable</em>: if we put a ball on a saddle point, then slightly perturb it, the ball is likely to fall! Of course we need to make this intuition formal in higher dimensions, as naively to find the direction to fall it seems to require computing the smallest eigenvector of the Hessian matrix.</p>
<p>To formalize this intuition we will try use a <em>noisy gradient descent</em></p>
<blockquote>
<p>$y = x - \eta \nabla f(x) + \epsilon.$</p>
</blockquote>
<p>Here $\epsilon$ is a noise vector that has mean $0$. This additional noise is going to deliver the initial <em>nudge</em> that makes the ball fall along the slope.</p>
<p>In fact, often it is much cheaper to compute a noisy gradient than the true gradient — this is the key idea in <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient</a> , and a large body of work shows that the noise does not interfere with convergence for convex optimization. For non-convex optimization, intuitively people believed the inherent noise <em>helps</em> in convergence because it pushes the current point away from <em>saddle points</em>. It’s not a bug, it’s a feature!</p>
<p style="text-align:center;">
<img src="/assets/saddle/escapesmall.png" alt="Escaping from saddle points" />
</p>
<p>Previously, there were no good upper bound known on the number of iterations needed to escape saddle points and arrive at a local minimum. In <a href="http://arxiv.org/abs/1503.02101">Ge et al. 2015</a>, we show</p>
<blockquote>
<p><strong>Theorem(Informal)</strong> Noisy gradient descent can find a local minimum of strict saddle functions in polynomial time.</p>
</blockquote>
<p>The polynomial dependency on the dimension $n$ and the smallest eigenvalue of the Hessian are fairly high and not very practical. It is an open problem to find the optimal convergence rate for strict saddle problems.</p>
<p>A recent subsequent paper by <a href="http://arxiv.org/abs/1602.04915">Lee et al.</a> showed even without adding noise, gradient descent will not converge to any strict saddle point if the initial point is chosen randomly. However their result relies on the <a href="https://en.wikipedia.org/wiki/Stable_manifold_theorem">Stable Manifold Theorem</a> from dynamical systems theory, which inherently does not provide any upperbound on the number of steps.</p>
<h2 id="beyond-simple-saddle-points">Beyond Simple Saddle Points</h2>
<p>We have seen algorithms that can handle (simple) saddle points. However, non-convex problems can have much more complicated landscapes that involve <em>degenerate</em> saddle points — points whose Hessian is positive semidefinite and have 0 eigenvalues. Such degenerate structure often indicates a complicated saddle point (such as a <a href="https://en.wikipedia.org/wiki/Monkey_saddle">monkey saddle</a>, Figure (a)) or a set of connected saddle points (Figures (b)(c)). In <a href="http://arxiv.org/abs/1602.05908">Anandkumar, Ge 2016</a> we gave an algorithm that can deal with some of these <em>degenerate</em> saddle points.</p>
<p><img src="/assets/saddle/highorder.png" alt="Higher order saddle points" /></p>
<p>The landscapes of non-convex functions can be very complicated, and there are still many open problems. What other functions are <em>strict saddle</em>? How do we make optimization algorithms that work even when there are degenerate saddle points or even <em>spurious</em> local minima? We hope more researchers will be interested in these problems!</p>
Tue, 22 Mar 2016 02:00:00 -0700
http://offconvex.github.io/2016/03/22/saddlepoints/
http://offconvex.github.io/2016/03/22/saddlepoints/