Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Generalization Theory and Deep Nets, An introduction<p>Deep learning holds many mysteries for theory, as we have discussed on this blog. Lately many ML theorists have become interested in the generalization mystery: why do trained deep nets perform well on previously unseen data, even though they have way more free parameters than the number of datapoints (the classic “overfitting” regime)? Zhang et al.’s paper <a href="https://arxiv.org/abs/1611.03530">Understanding Deep Learning requires Rethinking Generalization</a> played some role in bringing attention to this challenge. Their main experimental finding is that if you take a classic convnet architecture, say <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">Alexnet</a>, and train it on images with random labels, then you can still achieve very high accuracy on the training data. (Furthermore, usual regularization strategies, which are believed to promote better generalization, do not help much.) Needless to say, the trained net is subsequently unable to predict the (random) labels of still-unseen images, which means it doesn’t generalize. The paper notes that the ability to fit a classifier to data with random labels is also a traditional measure in machine learning called Rademacher complexity (which we will discuss shortly) and thus Rademacher complexity gives no meaningful bounds on sample complexity. I found this paper entertainingly written and recommend reading it, despite having given away the punchline. Congratulations to the authors for winning best paper at ICLR 2017.</p>
<p>But I would be remiss if I didn’t report that at the <a href="https://simons.berkeley.edu/programs/machinelearning2017">Simons Institute Semester on theoretical ML in spring 2017</a> generalization theory experts expressed unhappiness about this paper, and especially its title. They felt that similar issues had been extensively studied in context of simpler models such as kernel SVMs (which, to be fair, is clearly mentioned in the paper). It is trivial to design SVM architectures with high Rademacher complexity which nevertheless train and generalize well on real-life data. Furthermore, theory was developed to explain this generalization behavior (and also for related models like boosting). On a related note, several earlier papers of Behnam Neyshabur and coauthors (see <a href="https://arxiv.org/abs/1605.07154">this paper</a> and for a full account, <a href="https://arxiv.org/abs/1703.11008">Behnam’s thesis</a>)
had made points fairly similar to Zhang et al. pertaining to deep nets.</p>
<p>But regardless of such complaints, we should be happy about the attention brought by Zhang et al.’s paper to a core theory challenge. Indeed, the passionate discussants at the Simons semester themselves banded up in subgroups to address this challenge: these resulted in papers by <a href="https://arxiv.org/abs/1703.11008">Dzigaite and Roy</a>, then <a href="https://arxiv.org/abs/1706.08498">Bartlett, Foster, and Telgarsky</a> and finally <a href="https://arxiv.org/abs/1707.09564">Neyshabur, Bhojapalli, MacAallester, Srebro</a>. (The latter two were presented at NIPS’17 this week.)</p>
<p>Before surveying these results let me start by suggesting that some of the controversy over the title of Zhang et al.’s paper stems from some basic confusion about whether or not current generalization theory is prescriptive or merely descriptive. These confusions arise from the standard treatment of generalization theory in courses and textbooks, as I discovered while teaching the recent developments in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/">my graduate seminar</a>.</p>
<h3 id="prescriptive-versus-descriptive-theory">Prescriptive versus descriptive theory</h3>
<p>To illustrate the difference, consider a patient who says to his doctor: “Doctor, I wake up often at night and am tired all day.”</p>
<blockquote>
<p>Doctor 1 (without any physical examination): “Oh, you have sleep disorder.”</p>
</blockquote>
<p>I call such a diagnosis <em>descriptive</em>, since it only attaches a label to the patient’s problem, without giving any insight into how to solve the problem. Contrast with:</p>
<blockquote>
<p>Doctor 2 (after careful physical examination): “A growth in your sinus is causing sleep apnea. Removing it will resolve your problems.”</p>
</blockquote>
<p>Such a diagnosis is <em>prescriptive.</em></p>
<h2 id="generalization-theory-descriptive-or-prescriptive">Generalization theory: descriptive or prescriptive?</h2>
<p>Generalization theory notions such as VC dimension, Rademacher complexity, and PAC-Bayes bound, consist of attaching a <em>descriptive label</em> to the basic phenomenon of lack of generalization. They are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems.</p>
<p>Recall what it means for a hypothesis/classifier $h$ to not generalize. Assume the training data consists of a sample $S = {(x_1, y_1), (x_2, y_2),\ldots, (x_m, y_m)}$ of $m$ examples from some distribution ${\mathcal D}$. A <em>loss function</em> $\ell$ describes how well hypothesis $h$ classifies a datapoint: the loss $\ell(h, (x, y))$ is high if the hypothesis didn’t come close to producing the label $y$ on $x$ and low if it came close. (To give an example, the <em>regression</em> loss is $(h(x) -y)^2$.) Now let us denote by $\Delta_S(h)$ the average loss on samplepoints in $S$, and by $\Delta_{\mathcal D}(h)$ the expected loss on samples from distribution ${\mathcal D}$.
Training <em>generalizes</em> if the hypothesis $h$ that minimises $\Delta_S(h)$ for a random sample $S$ also achieves very similarly low loss $\Delta_{\mathcal D}(h)$ on the full distribution. When this fails to happen, we have:</p>
<blockquote>
<p><strong>Lack of generalization:</strong> $\Delta_S(h) \ll \Delta_{\mathcal D}(h) \qquad (1). $</p>
</blockquote>
<p>In practice, lack of generalization is detected by taking a second sample
(“held out set”) $S_2$ of size $m$ from ${\mathcal D}$. By concentration bounds expected loss of $h$ on this second sample closely approximates $\Delta_{\mathcal D}(h)$, allowing us to conclude</p>
<script type="math/tex; mode=display">\Delta_S(h) - \Delta_{S_2}(h) \ll 0 \qquad (2).</script>
<h3 id="generalization-theory-descriptive-parts">Generalization Theory: Descriptive Parts</h3>
<p>Let’s discuss <strong>Rademacher complexity,</strong> which I will simplify a bit for this discussion. (See also <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes of my lecture</a>.) For convenience assume in this discussion that labels and loss are $0,1$, and
assume that the badly generalizing $h$ predicts perfectly on the training sample $S$ and is completely wrong on the heldout set $S_2$, meaning</p>
<p>$\Delta_S(h) - \Delta_{S_2}(h) \approx - 1 \qquad (3)$</p>
<p>Rademacher complexity concerns the following thought experiment. Take a single sample of size $2m$ from $\mathcal{D}$, split it into two and call the first half $S$ and the second $S_2$. <em>Flip</em> the labels of points in $S_2$. Now try to find a classifier $C$ that best describes this new sample, meaning one that minimizes $\Delta_S(h) + 1- \Delta_{S_2}(h)$. This expression follows since flipping the label of a point turns good classification into bad and vice versa, and thus the loss function for $S_2$ is $1$ minus the old loss. We say the class of classifiers has high Rademacher complexity if with high probability this quantity is small, say close to $0$.</p>
<p>But a glance at (3) shows that it implies high Rademacher complexity: $S, S_2$ were random samples of size $m$ from $\mathcal{D}$, so their combined size is $2m$, and when generalization failed we succeeded in finding a hypothesis $h$ for which $\Delta_S(h) + 1- \Delta_{S_2}(h)$ is very small.</p>
<p>In other words, returning to our medical analogy, the doctor only had to hear “Generalization didn’t happen” to pipe up with: “Rademacher complexity is high.” This is why I call this result descriptive.</p>
<p>The <strong>VC dimension</strong> bound is similarly descriptive. VC dimension is defined to be at least $k +1$ if there exists a set of size $k$ such that the following is true. If we look at all possible classifiers in the class, and the sequence of labels each gives to the $k$ datapoints in the sample, then we can find all possible $2^{k}$ sequences of $0$’s and $1$’s.</p>
<p>If generalization does not happen as in (2) or (3) then this turns out to imply that VC dimension is at least around $\epsilon m$ for some $\epsilon >0$. The reason is that the $2m$ data points were split randomly into $S, S_2$, and there are $2^{2m}$ such splittings. When the generalization error is $\Omega(1)$ this can be shown to imply that we can achieve $2^{\Omega(m)}$ labelings of the $2m$ datapoints using all possible classifiers. Now the classic Sauer’s lemma (see any lecture notes on this topic, such as <a href="https://www.cs.princeton.edu/courses/archive/spring14/cos511/scribe_notes/0220.pdf">Schapire’s</a>) can be used to show that
VC dimension is at least $\epsilon m/\log m$ for some constant $\epsilon>0$.</p>
<p>Thus again, the doctor only has to hear “Generalization didn’t happen with sample size $m$” to pipe up with: “VC dimension is higher than $\Omega(m/log m)$.”</p>
<p>One can similarly show that PAC-Bayes bounds are also descriptive, as you can see in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes from my lecture</a>.</p>
<blockquote>
<p>Why do students get confused and think that such tools of generalization theory gives some powerful technique to guide design of machine learning algorithms?</p>
</blockquote>
<p>Answer: Probably because standard presentation in lecture notes and textbooks seems to pretend that we are computationally-omnipotent beings who can <em>compute</em> VC dimension and Rademacher complexity and thus arrive at meaningful bounds on sample sizes needed for training to generalize. While this may have been possible in the old days with simple classifiers, today we have
complicated classifiers with millions of variables, which furthermore are products of nonconvex optimization techniques like backpropagation.
The only way to actually lowerbound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a held-out set. Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al. for highlighting that theory currently offers nothing better.</p>
<h2 id="toward-a-prescriptive-generalization-theory-the-new-papers">Toward a prescriptive generalization theory: the new papers</h2>
<p>In our medical analogy we saw that the doctor needs to at least do a physical examination to have a prescriptive diagnosis. The authors of the new papers intuitively grasp this point, and try to identify properties of real-life deep nets that may lead to better generalization. Such an analysis (related to “margin”) was done for simple 2-layer networks couple decades ago, and the challenge is to find analogs for multilayer networks. Both Bartlett et al. and Neyshabur et al. hone in on <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> of the weight matrices of the layers of the deep net. These can be seen as an instance of a “flat minimum” which has been discussed in <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">neural nets literature</a> for many years. I will present my take on these results as well as some improvements in a future post. Note that these methods do not as yet give any nontrivial bounds on the number of datapoints needed for training the nets in question.</p>
<p><a href="https://arxiv.org/abs/1703.11008">Dziugaite and Roy</a> take a slightly different tack. They start with McAllester’s 1999 PAC-Bayes bound, which says that if the algorithm’s prior distribution on the hypotheses is $P$ then for every posterior distributions $Q$ (which could depend on the data) on the hypotheses the generalization error of the average classifier picked according to $Q$ is upper bounded as follows where $D()$ denotes <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>:</p>
<div style="text-align:center;">
<img style="width:600px;" src="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/pacbayes.png" />
</div>
<p>This allows upperbounds on generalization error (specifically, upperbounds on number of samples that guarantee such an upperbound) by proceeding as in <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford and Caruana’s old paper</a> where $P$ is a uniform gaussian, and $Q$ is a noised version of the trained deep net (whose generalization we are trying to explain). Specifically, if $w_{ij}$ is the weight of edge ${i, j}$ in the trained net, then $Q$ consists of adding a gaussian noise $\eta_{ij}$ to weight $w_{ij}$. Thus a random classifier according to $Q$ is nothing but a noised version of the trained net. Now we arrive at the crucial idea: Use nonconvex optimization to find a choice for the variance of $\eta_{ij}$ that balances two competing criteria: (a) the average classifier drawn from $Q$ has training error not much more than the original trained net (again, this is a quantification of the “flatness” of the minimum found by the optimization) and (b) the right hand side of the above expression is as small as possible. Assuming (a) and (b) can be suitably bounded, it follows that the average classifier from Q works reasonably well on unseen data. (Note that this method only proves generalization of a noised version of the trained classifier.)</p>
<p>Applying this method on simple fully-connected neural nets trained on MNIST dataset, they can prove that the method achieves error $17$ percent error on MNIST (whereas the <em>actual</em> error is much lower at 2-3 percent). Hence the title of their paper, which promises <em>nonvacuous generalization bounds.</em> What I find most interesting about this result is that it uses the power of nonconvex optimization (harnessed above to find a suitable noised distribution $Q$) to cast light on one of the metaquestions about nonconvex optimization, namely, why does deep learning not overfit!</p>
Fri, 08 Dec 2017 18:00:00 +0000
http://offconvex.github.io/2017/12/08/generalization1/
http://offconvex.github.io/2017/12/08/generalization1/How to Escape Saddle Points Efficiently<p>A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s</a> blog posts), the critical open problem is one of <strong>efficiency</strong> — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe <a href="https://arxiv.org/abs/1703.00887">our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade</a>, that provides the first provable <em>positive</em> answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!</p>
<h2 id="perturbing-gradient-descent">Perturbing Gradient Descent</h2>
<p>We are in the realm of classical gradient descent (GD) — given a function $f:\mathbb{R}^d \to \mathbb{R}$ we aim to minimize the function by moving in the direction of the negative gradient:</p>
<script type="math/tex; mode=display">x_{t+1} = x_t - \eta \nabla f(x_t),</script>
<p>where $x_t$ are the iterates and $\eta$ is the step size. GD is well understood theorietically in the case of convex optimization, but the general case of nonconvex optimization has been far less studied. We know that GD converges quickly to the neighborhood of stationary points (points where $\nabla f(x) = 0$) in the nonconvex setting, but these stationary points may be local minima or, unhelpfully, local maxima or saddle points.</p>
<p>Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:</p>
<ol>
<li>
<p><strong>Intermittent Perturbations</strong>: <a href="http://arxiv.org/abs/1503.02101">Ge, Huang, Jin and Yuan 2015</a> considered adding occasional random perturbations to GD, and were able to provide the first <em>polynomial time</em> guarantee for GD to escape saddle points. (See also <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s post</a> )</p>
</li>
<li>
<p><strong>Random Initialization</strong>: <a href="http://arxiv.org/abs/1602.04915">Lee et al. 2016</a> showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s post</a>)</p>
</li>
</ol>
<p>Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.</p>
<p>One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.</p>
<p><strong><em>Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?</em></strong></p>
<p>A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably <em>inefficient</em> in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).</p>
<p>Somewhat surprisingly, it turns out that we obtain a rather different — and <em>positive</em> — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:</p>
<blockquote>
<p><strong>Perturbed gradient descent (PGD)</strong></p>
<ol>
<li><strong>for</strong> $~t = 1, 2, \ldots ~$ <strong>do</strong></li>
<li>$\quad\quad x_{t} \leftarrow x_{t-1} - \eta \nabla f (x_{t-1})$</li>
<li>$\quad\quad$ <strong>if</strong> $~$<em>perturbation condition holds</em>$~$ <strong>then</strong></li>
<li>$\quad\quad\quad\quad x_t \leftarrow x_t + \xi_t$</li>
</ol>
</blockquote>
<p>Here the perturbation $\xi_t$ is sampled uniformly from a ball centered at zero with a suitably small radius, and is added to the iterate when the gradient is suitably small. These particular choices are made for analytic convenience; we do not believe that uniform noise is necessary. nor do we believe it essential that noise be added only when the gradient is small.</p>
<h2 id="strict-saddle-and-second-order-stationary-points">Strict-Saddle and Second-order Stationary Points</h2>
<p>We define <em>saddle points</em> in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along <em>at least one direction</em>. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\lambda_{\min}(\nabla^2 f(x)) \begin{cases}
> 0 \quad\quad \text{local minimum} \\
= 0 \quad\quad \text{local minimum or saddle point} \\
< 0 \quad\quad \text{saddle point}
\end{cases} %]]></script>
<p>We further call the saddle points in the last category, where $\lambda_{\min}(\nabla^2 f(x)) < 0$, <strong>strict saddle points</strong>.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/strictsaddle.png" width="85%" alt="Strict and Non-strict Saddle Point" />
</p>
<p>While non-strict saddle points can be flat in the valley, strict saddle points require that there is <em>at least one direction</em> along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is <em>NP-hard</em>; therefore, we — and previous authors — focus on escaping <em>strict</em> saddle points.</p>
<p>Formally, we make the following two standard assumptions regarding smoothness.</p>
<blockquote>
<p><strong>Assumption 1</strong>: $f$ is $\ell$-gradient-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2, |\nabla f(x_1) - \nabla f(x_2)| \le \ell |x_1 - x_2|$. <br />
$~$<br />
<strong>Assumption 2</strong>: $f$ is $\rho$-Hessian-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2$, $|\nabla^2 f(x_1) - \nabla^2 f(x_2)| \le \rho |x_1 - x_2|$.</p>
</blockquote>
<p>Similarly to classical theory, which studies convergence to a first-order stationary point, $\nabla f(x) = 0$, by bounding the number of iterations to find a <strong>$\epsilon$-first-order stationary point</strong>, $|\nabla f(x)| \le \epsilon$, we formulate the speed of escape of strict saddle points and the ensuing convergence to a second-order stationary point, $\nabla f(x) = 0, \lambda_{\min}(\nabla^2 f(x)) \ge 0$, with an $\epsilon$-version of the definition:</p>
<blockquote>
<p><strong>Definition</strong>: A point $x$ is an <strong>$\epsilon$-second-order stationary point</strong> if:<br />
$\quad\quad\quad\quad |\nabla f(x)|\le \epsilon$, and $\lambda_{\min}(\nabla^2 f(x)) \ge -\sqrt{\rho \epsilon}$.</p>
</blockquote>
<p>In this definition, $\rho$ is the Hessian Lipschitz constant introduced above. This scaling follows the convention of <a href="http://rd.springer.com/article/10.1007%2Fs10107-006-0706-8">Nesterov and Polyak 2006</a>.</p>
<h3 id="applications">Applications</h3>
<p>In a wide range of practical nonconvex problems it has been proved that <strong>all saddle points are strict</strong> — such problems include, but not are limited to, principal components analysis, canonical correlation analysis,
<a href="http://arxiv.org/abs/1503.02101">orthogonal tensor decomposition</a>,
<a href="http://arxiv.org/abs/1602.06664">phase retrieval</a>,
<a href="http://arxiv.org/abs/1504.06785">dictionary learning</a>,
<!-- matrix factorization, -->
<a href="http://arxiv.org/abs/1605.07221">matrix sensing</a>,
<a href="http://arxiv.org/abs/1605.07272">matrix completion</a>,
and <a href="http://arxiv.org/abs/1704.00708">other nonconvex low-rank problems</a>.</p>
<p>Furthermore, in all of these nonconvex problems, it also turns out that <strong>all local minima are global minima</strong>. Thus, in these cases, any general efficient algorithm for finding $\epsilon$-second-order stationary points immediately becomes an efficient algorithm for solving those nonconvex problem with global guarantees.</p>
<h2 id="escaping-saddle-point-with-negligible-overhead">Escaping Saddle Point with Negligible Overhead</h2>
<p>In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:</p>
<blockquote>
<p><strong>Theorem (<a href="http://rd.springer.com/book/10.1007%2F978-1-4419-8853-9">Nesterov 1998</a>)</strong>: If Assumption 1 holds, then GD, with $\eta = 1/\ell$, finds an $\epsilon$-<strong>first</strong>-order stationary point in $2\ell (f(x_0) - f^\star)/\epsilon^2$ iterations.</p>
</blockquote>
<p>In this theorem, $x_0$ is the initial point and $f^\star$ is the function value of the global minimum. The theorem says for that any gradient-Lipschitz function, a stationary point can be found by GD in $O(1/\epsilon^2)$ steps, with no explicit dependence on $d$. This is called “dimension-free optimization” in the literature; of course the cost of a gradient computation is $O(d)$, and thus the overall runtime of GD scales as $O(d)$. The linear scaling in $d$ is especially important for modern high-dimensional nonconvex problems such as deep learning.</p>
<p>We now wish to address the corresponding problem for second-order stationary points.
What is the best we can hope for? Can we also achieve</p>
<ol>
<li>A dimension-free number of iterations;</li>
<li>An $O(1/\epsilon^2)$ convergence rate;</li>
<li>The same dependence on $\ell$ and $(f(x_0) - f^\star)$ as in (Nesterov 1998)?</li>
</ol>
<p>Rather surprisingly, the answer is <em>Yes</em> to all three questions (up to small log factors).</p>
<blockquote>
<p><strong>Main Theorem</strong>: If Assumptions 1 and 2 hold, then PGD, with $\eta = O(1/\ell)$, finds an $\epsilon$-<strong>second</strong>-order stationary point in $\tilde{O}(\ell (f(x_0) - f^\star)/\epsilon^2)$ iterations with high probability.</p>
</blockquote>
<p>Here $\tilde{O}(\cdot)$ hides only logarithmic factors; indeed, the dimension dependence in our result is only $\log^4(d)$. The theorem thus asserts that a perturbed form of GD, under an additional Hessian-Lipschitz condition, <strong><em>converges to a second-order-stationary point in almost the same time required for GD to converge to a first-order-stationary point.</em></strong> In this sense, we claim that PGD can escape strict saddle points almost for free.</p>
<p>We turn to a discussion of some of the intuitions underlying these results.</p>
<h3 id="why-do-polylogd-iterations-suffice">Why do polylog(d) iterations suffice?</h3>
<p>Our strict-saddle assumption means that there is only, in the worst case, one direction in $d$ dimensions along which we can escape. A naive search for the descent direction intuitively should take at least $\text{poly}(d)$ iterations, so why should only $\text{polylog}(d)$ suffice?</p>
<p>Consider a simple case in which we assume that the function is quadratic in the neighborhood of the saddle point. That is, let the objective function be $f(x) = x^\top H x$, a saddle point at zero, with constant Hessian $H = \text{diag}(-1, 1, \cdots, 1)$. In this case, only the first direction is an escape direction (with negative eigenvalue $-1$).</p>
<p>It is straightforward to work out the general form of the iterates in this case:</p>
<script type="math/tex; mode=display">x_t = x_{t-1} - \eta \nabla f(x_{t-1}) = (I - \eta H)x_{t-1} = (I - \eta H)^t x_0.</script>
<p>Assume that we start at the saddle point at zero, then add a perturbation so that $x_0$ is sampled uniformly from a ball $\mathcal{B}_0(1)$ centered at zero with radius one.
The decrease in the function value can be expressed as:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = x_t^\top H x_t = x_0^\top (I - \eta H)^t H (I - \eta H)^t x_0.</script>
<p>Set the step size to be $1/2$, let $\lambda_i$ denote the $i$-th eigenvalue of the Hessian $H$ and let $\alpha_i = e_i^\top x_0$ denote the component in the $i$th direction of the initial point $x_0$. We have $\sum_{i=1}^d \alpha_i^2 = | x_0|^2 = 1$, thus:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = \sum_{i=1}^d \lambda_i (1-\eta\lambda_i)^{2t} \alpha_i^2 \le -1.5^{2t} \alpha_1^2 + 0.5^{2t}.</script>
<p>A simple probability argument shows that sampling uniformly in $\mathcal{B}_0(1)$ will result in at least a $\Omega(1/d)$ component in the first direction with high probability. That is, $\alpha^2_1 = \Omega(1/d)$. Substituting $\alpha_1$ in the above equation, we see that it takes at most $O(\log d)$ steps for the function value to decrease by a constant amount.</p>
<h3 id="pancake-shape-stuck-region-for-general-hessian">Pancake-shape stuck region for general Hessian</h3>
<p>We can conclude that for the case of a constant Hessian, only when the perturbation $x_0$ lands in the set $\{x | ~ |e_1^\top x|^2 \le O(1/d)\}$ $\cap \mathcal{B}_0 (1)$, can we take a very long time to escape the saddle point. We call this set the <strong>stuck region</strong>; in this case it is a flat disk. In general, when the Hessian is no longer constant, the stuck region becomes a non-flat pancake, depicted as a green object in the left graph. In general this region will not have an analytic expression.</p>
<p>Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — <em>although we don’t know the shape of the stuck region, we know it is very thin</em>.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/flow.png" width="85%" alt="Pancake" />
</p>
<p>In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points $w, u$ separated by $O(1/\sqrt{d})$ along an escaping direction. We claim that if we run GD starting at $w$ and $u$, at least one of the resulting trajectories will escape the saddle point very quickly. This implies that the thickness of the stuck region can be at most $O(1/\sqrt{d})$, so a random perturbation has very little chance to land in the stuck region.</p>
<h2 id="on-the-necessity-of-adding-perturbations">On the Necessity of Adding Perturbations</h2>
<p>We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent <a href="http://arxiv.org/abs/1705.10412">joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh</a>, we have shown that even with fairly natural random initialization schemes and non-pathological functions, <strong>GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.</strong></p>
<p>To establish this result, we considered random initializations from a very general class including Gaussians and uniform distributions over the hypercube, and we constructed a smooth objective function that satisfies both Assumptions 1 and 2. This function is constructed such that, even with random initialization, with high probability both GD and PGD have to travel sequentially in the vicinity of $d$ strict saddle points before reaching a local minimum. All strict saddle points have only one direction of escape. (See the left graph for the case of $d=2$).</p>
<p><img src="/assets/saddle_eff/necesperturbation.png" alt="NecessityPerturbation" /></p>
<p>When GD travels in the vicinity of a sequence of saddle points, it can get closer and closer to the later saddle points, and thereby take longer and longer to escape. Indeed, the time to escape the $i$th saddle point scales as $e^{i}$. On the other hand, PGD is always able to escape any saddle point in a small number of steps independent of the history. This phenomenon is confirmed by our experiments; see, for example, an experiment with $d=10$ in the right graph.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.</p>
<p>There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.</p>
Wed, 19 Jul 2017 10:00:00 +0000
http://offconvex.github.io/2017/07/19/saddle-efficiency/
http://offconvex.github.io/2017/07/19/saddle-efficiency/Do GANs actually do distribution learning?<p>This post is about our new paper, which presents empirical evidence that current GANs (Generative Adversarial Nets) are quite far from learning the target distribution. Previous posts had <a href="http://www.offconvex.org/2017/03/15/GANs/">introduced GANs</a> and described <a href="http://www.offconvex.org/2017/03/30/GANs2/">new theoretical analysis of GANs</a> from <a href="https://arxiv.org/abs/1703.00573">our ICML17 paper</a>. One notable implication of our theoretical analysis was that when the discriminator size is bounded, then GANs training could appear to succeed (i.e., training objective reaches its optimum value) even if the generated distribution is discrete and has very low support —-in other words, the training objective is unable to prevent even extreme <em>mode collapse</em>.</p>
<p>That paper led us (especially Sanjeev) into spirited discussions with colleagues, who wondered if this is <em>just</em> a theoretical result about potential misbehavior rather than a prediction about real-life training. After all, we’ve all seen the great pictures that GANs produce in real life, right? (Note that the theoretical result only describes a possible near-equilibrium that can arise with a certain mix of hyperparameters, and conceivably real-life training avoids that by suitable hyperparameter tuning.)</p>
<p>Our new empirical paper <a href="https://arxiv.org/abs/1706.08224v1">Do GANs actually learn the distribution? An empirical study</a> puts the issue to the test. We present empirical evidence that well-known GANs approaches do end up learning distributions of fairly low support, and thus presumably are not learning the target distribution.</p>
<p>Let’s start by imagining how large the support must be for the target distribution. For example, if the distribution is the set of all possible images of human faces (real or imagined), then these must involve all combinations of hair color/style, facial features, complexion, expression, pose, lighting, race, etc., and thus the possible set of images of faces that <em>humans will consider to be distinct</em> approaches infinity. (After all, there are billions of distinct people living on earth right now.)
GANs are trying to learn this full distribution using a finite sample of images, say <a href="http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html">CelebA</a> which has $200,000$ images of celebrity faces.</p>
<p>Thus a simple sanity check for whether a GAN has truly come close to learning this distribution is to estimate how many “distinct” images it can produce. At first glance, such an estimation seems very difficult. After all, automated/heuristic measures of image similarity can be easily fooled, and we humans surely don’t have enough time to go through millions or billions of images, right?</p>
<p>Luckily, a crude estimate is possible using the simple birthday paradox, a staple of undergrad discrete math.</p>
<h2 id="birthday-paradox-test-for-size-of-the-support">Birthday paradox test for size of the support</h2>
<p>Imagine for argument’s sake that the human race were limited to a genetic diversity of a million —nature’s laws only allow this many distinct humans. How would this hard limit manifest itself in our day to day life? The birthday paradox says that if we take a random sample of a thousand people —note that most of us get to know this many people easily in our lifetimes—we’d see many <a href="https://en.wikipedia.org/wiki/Doppelg%C3%A4nger_">doppelgangers</a>. Of course, in practice the only doppelgangers we encounter happen to be identical twins.</p>
<p>Formally, the birthday paradox says that if a discrete distribution has support $N$, then a random sample of size about
$\sqrt{N}$ would be quite likely to contain a duplicate. (The name comes from its implication that if you put $23 \approx \sqrt{365}$ random people in a room, the chance that two of them have the same birthday is about $1/2$.)</p>
<p>In the GAN setting, the distribution is continuous, not discrete. Thus our proposed birthday paradox test for GANs is as follows.</p>
<p>(a) Pick a sample of size $s$ from the generated distribution. (b) Use an automated measure of image similarity to flag the $20$ (say) most similar pairs in the sample. (c) Visually inspect the flagged pairs and check for images that a human would consider near-duplicates. (d) Repeat.</p>
<p>If this test reveals that samples of size $s$ have duplicate images with good probability, then suspect that the distribution has support size about $s^2$.</p>
<p>Note that the test is not definitive, because the distribution could assign say a probability $10\%$ to a single image, and be uniform on a huge number of other images. Then the test would be quite likely to find a duplicate even with $20$ samples, even though the true support size is huge. But such nonuniformity (a lot of probability being assigned to a few images) is the only failure mode of the birthday paradox test calculation, and such nonuniformity would itself be considered a failure mode of GANs training. The CIFAR-10 samples below show that such nonuniformality can be severe in practice, where the generator tends to generate a fixed image of automobile very likely. On CIFAR-10, this failure mode is also observed in classes of frogs and cats.</p>
<h2 id="experimental-results">Experimental results.</h2>
<p>Our test was done using two datasets, CelebA (faces) and CIFAR-10.</p>
<p>For faces, we found Euclidean distance in pixel space works well as a heuristic similarity measure, probably because the samples are centered and aligned. For CIFAR-10, we pre-train a discriminative Convolutional Neural Net for the full classification problem, and use the top layer representation as an embedding of the image. Heuristic similarity is then measured as the Euclidean distance in the embedding space. Possibly these similarity measures are crude, but note that improving them can only <em>lower</em> our estimate of the support size of the distribution, since a better similarity measure can only increase the number of duplicates found. Thus our estimates below should be considered as upper bounds on the support size of the distribution.</p>
<h2 id="results--on-celeba-dataset">Results on CelebA dataset</h2>
<p>We tested the following methods, doing the birthday paradox test with Euclidean distance in pixel space as the heuristic similarity measure.</p>
<ul>
<li>DCGAN —unconditional as described in <a href="https://arxiv.org/abs/1406.2661">Goodfellow et al. 2014</a> and <a href="https://arxiv.org/abs/1511.06434">Radford et al. 2015</a></li>
<li>MIX+GAN protocol introduced in <a href="https://arxiv.org/abs/1703.00573">Arora et al.</a>, specifically, MIX+DCGAN with $3$ mixture components.</li>
<li><a href="https://arxiv.org/abs/1606.00704">Adversarily Learned Inference (ALI)</a> (or equivalently <a href="https://arxiv.org/abs/1605.09782">BiGANs</a>). (ALI is probabilistic version of BiGANs, but their architectures are equivalent. So we only tested ALI in our experiments.)</li>
</ul>
<p>We find that with probability $\geq50\%$, a batch of about $400$ samples contains at least one pair of duplicates for both DCGAN and MIX+DCGAN. The figure below give examples duplicates and their nearest neighbors samples (that we could fine) in training set. These results suggest that the support size of the distribution is less than $400^2\approx160000$, which is actually lower than the diversity of the training set, but this distribution is not just memorizing the training set.</p>
<p>ALI (or BiGANs) appear to be somewhat more diverse, in that collisions appear with $50\%$ probability only with a batch size of $1000$, implying a support size of a million. This is $5$x the training set, but still much smaller than the diversity one would expect among human faces (After all doppelgangers don’t appear in samples of a few thousand people in real life.) For fair comparison, we set the discriminator of ALI (or BiGANs) to be roughly the same in size as that of the DCGAN model, since the results below suggests that the discriminator size has a strong effect on diversity of the learnt distribution.) Nevertheless, these tests do support the suggestion that the bidirectional structure prevents some of the mode collapses observed in usual GANs.</p>
<p><img src="https://www.dropbox.com/s/7v2qbs4i82cczsy/similar_face_pairs.png?dl=1" alt="similar_face_pairs" /></p>
<h2 id="diversity-vs-discriminator-size">Diversity vs Discriminator Size</h2>
<p>The analysis of <a href="https://arxiv.org/abs/1703.00573">Arora et al.</a> suggested that the support size could be as low as near-linear in the capacity of the discriminator; in other words, there is a near-equilibrium in which a distribution of such a small support could suffice to fool the best discriminator. So it is worth investigating whether training in real life allows generator nets to exploit this “loophole” in the training that we now know is in principle available to them.</p>
<p>We built DCGANs with increasingly larger discriminators while fixing the other hyper-parameters. The discriminator used here is a 5-layer Convolutional Neural Network such that the number of output channels of each layer is $1\times,2\times,4\times,8\times\textit{dim}$ where $dim$ is chosen to be $16,32,48,64,80,96,112,128$. Thus the discriminator size should be proportional to $dim^2$. The figure below suggests that in this simple setup the diversity of the learnt distribution does indeed grow near-linearly with the discriminator size. (Note the diversity is seen to plateau, possibly because one needs to change other parameters like depth to meaningfully add more capacity to the discriminator.)</p>
<p><img src="https://www.dropbox.com/s/zmhprwu2w2rddep/diversity_vs_size.png?dl=1" alt="diversity_vs_size" /></p>
<h2 id="results-for-cifar-10">Results for CIFAR-10</h2>
<p>On CIFAR-10, as mentioned earlier, we use a heuristic image similarity computed with convolutional neural net with 3 convolutional layers, 2 fully-connected layer and a 10-class soft-max output pretrained with a multi-class classification objective. Specifically, the top layer features are viewed as embeddings for similarity test using Euclidean distance.
We found that this heuristic similarity test quickly becomes useless if the samples display noise artifacts, and thus was effective only on the very best GANs that generate the most real-looking images. For CIFAR-10
this led us to <a href="https://arxiv.org/abs/1612.04357">Stacked GAN</a>, currently believed to be the best generative model on CIFAR-10 (Inception Score $8.59$). Since this model is trained by conditioning on class label, we measure its diversity within each class separately.</p>
<p>The training set for each class has $10k$ images, but since the generator is allowed to learn from all classes, presumably it can mix and match (especially background, lighting, landscape etc.) between classes and learn a fairly rich set of images.</p>
<p>Now we list the batch sizes needed for duplicates to appear.</p>
<p><img src="https://www.dropbox.com/s/bumdhzlcrk1z97b/cifar_diversity_table.png?dl=1" alt="cifar_diversity_table" /></p>
<p>As before, we show duplicate samples as well as the nearest neighbor to the samples in training set (identified by using heuristic similarity measure to flag possibilities and confirming visually).</p>
<p><img src="https://www.dropbox.com/s/8itrpjngrc13eam/selected_similar_cifar_samples.png?dl=1" alt="similar_cifar_samples" /></p>
<p>We find that the closest image is quite different from the duplicate detected, which suggests the issue with GANs is indeed lack of diversity (low support size) instead of memorizing training set. (See <a href="https://arxiv.org/abs/1706.08224v1">the paper</a> for more examples.)</p>
<p>Note that by and large the diversity of the learnt distribution is higher than that of the training set, but still not as high as one would expect in terms of all possible combinations.</p>
<h1 id="birthday-paradox-test-for-vaes">Birthday paradox test for VAEs</h1>
<p><img src="https://www.dropbox.com/s/p1qlgr66rnufnal/vae_collisions.png?dl=1" alt="vae_collisions" /></p>
<p>Given these findings, it is natural to wonder about the diversity of distributions learned using earlier methods such as <a href="https://arxiv.org/abs/1312.6114">Variational Auto-Encoders</a> (VAEs). Instead of using feedback from the discriminator, these methods train the generator net using feedback from an approximate perplexity calculation. Thus the analysis of <a href="https://arxiv.org/abs/1703.00573">Arora et al.</a> does not apply as is to such methods and it is conceivable they exhibit higher diversity. However, we found the birthday paradox test difficult to run since samples from a VAE trained on CelebA were not realistic or sharp enough for a human to definitively conclude whether or not two images were almost the same. The figure above shows examples of collision candidates found in batches of 400 samples; clearly some indicative parts (hair, eyes, mouth, etc.) are quite blurry in VAE samples.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Our new birthday paradox test seems to suggest that some well-regarded GANs are currently learning distributions that with rather low support (i.e., suffer mode collapse). The possibility of such a scenario was anticipated in the theoretical analysis of (<a href="https://arxiv.org/abs/1703.00573">Arora et al.</a>) reported in an earlier post.</p>
<p>This combination of theory and empirics raises the open problem of how to change the GANs training to avoid such mode collapse. Possibly ALI/BiGANs point to the right direction, since they exhibit somewhat better diversity in our experiments. One should also try tuning of hyperparameter/architecture in current methods now that the birthday paradox test gives a concrete way to quantify mode collapse.</p>
<p>Finally, we should consider the possibility that the best use of GANs and related techniques could be feature learning or some other goal, as opposed to distribution learning. This needs further theoretical and empirical exploration.</p>
Fri, 07 Jul 2017 06:00:00 +0000
http://offconvex.github.io/2017/07/07/GANs3/
http://offconvex.github.io/2017/07/07/GANs3/Unsupervised learning, one notion or many?<p><em>Unsupervised learning</em>, as the name suggests, is the science of learning from unlabeled data. A look at the <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">wikipedia page</a> shows that this term has many interpretations:</p>
<p><strong>(Task A)</strong> <em>Learning a distribution from samples.</em> (Examples: gaussian mixtures, topic models, variational autoencoders,..)</p>
<p><strong>(Task B)</strong> <em>Understanding latent structure in the data.</em> This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.</p>
<p><strong>(Task C)</strong> <em>Feature Learning.</em> Learn a mapping from <em>datapoint</em> $\rightarrow$ <em>feature vector</em> such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of <em>labeled</em> samples needed for learning a classifier, or be useful for <a href="https://en.wikipedia.org/wiki/Domain_adaptation"><em>domain adaptation</em></a>.</p>
<p>Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.</p>
<p>This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that for deep learning, unsupervised pretraining should help supervised training, but this has been hard to show in practice.</p>
<h2 id="the-common-theme-high-level-representations">The common theme: high level representations.</h2>
<p>If $x$ is a datapoint, each of these methods seeks to map it to a new “high level” representation $h$ that captures its “essence.”
This is why it helps to have access to $h$ when performing machine learning tasks on $x$ (e.g. classification).
The difficulty of course is that “high-level representation” is not uniquely defined. For example, $x$ may be an image, and $h$ may contain the information that it contains a person and a dog. But another $h$ may say that it shows a poodle and a person wearing pyjamas standing on the beach. This nonuniqueness seems inherent.</p>
<p>Unsupervised learning tries to learn high-level representation using unlabeled data.
Each method make an implicit assumption about how the hidden $h$ relates to the visible $x$. For example, in k-means clustering the hidden $h$ consists of labeling the datapoint with the index of the cluster it belongs to.
Clearly, such a simple clustering-based representation has rather limited expressive power since it groups datapoints into disjoint classes: this limits its application for complicated settings. For example, if one clusters images according to the labels “human”, “animal” “plant” etc., then which cluster should contain an image showing a man and a dog standing in front of a tree?</p>
<p>The search for a descriptive language for talking about the possible relationships of representations and data leads us naturally to Bayesian models. (Note that these are viewed with some skepticism in machine learning theory – compared to assumptionless models like PAC learning, online learning, etc. – but we do not know of another suitable vocabulary in this setting.)</p>
<h2 id="a-bayesian-view">A Bayesian view</h2>
<p>Bayesian approaches capture the relationship between the “high level” representation $h$ and the datapoint $x$ by postulating a <em>joint distribution</em> $p_{\theta}(x, h)$ of the data $x$ and representation $h$, such that $p_{\theta}(h)$ and the posterior $p_{\theta}(x \mid h)$ have a simple form as a function of the parameters $\theta$. These are also called <em>latent variable</em> probabilistic models, since $h$ is a latent (hidden) variable.</p>
<p>The standard goal in distribution learning is to find the $\theta$ that “best explains” the data (what we called Task (A)) above). This is formalized using maximum-likelihood estimation going back to Fisher (~1910-1920): find the $\theta$ that maximizes the <em>log probability</em> of the training data. Mathematically, indexing the samples with $t$, we can write this as</p>
<script type="math/tex; mode=display">\max_{\theta} \sum_{t} \log p_{\theta}(x_t) \qquad (1)</script>
<p>where
<script type="math/tex">p_{\theta}(x_t) = \sum_{h_t}p_{\theta}(x_t, h_t).</script></p>
<p>(Note that $\sum_{t} \log p_{\theta}(x_t)$ is also the empirical estimate of the <em>cross-entropy</em>
$E_{x}[\log p_{\theta}(x)]$ of the distribution $p_{\theta}$, where $x$ is distributed according to $p^*$, the true distribution of the data. Thus the above method looks for the distribution with best cross-entropy on the empirical data, which is also log of the <a href="https://en.wikipedia.org/wiki/Perplexity"><em>perplexity</em></a> of $p_{\theta}$.)</p>
<p>In the limit of $t \to ∞$, this estimator is <em>consistent</em> (converges in probability to the ground-truth value) and <em>efficient</em> (has lowest asymptotic mean-square-error among all consistent estimators). See the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">Wikipedia page</a>. (Aside: maximum likelihood estimation is often NP-hard, which is one of the reasons for the renaissance of the method-of-moments and tensor decomposition algorithms in learning latent variable models, which <a href="http://www.offconvex.org/2015/12/17/tensor-decompositions/">Rong wrote about some time ago</a>.)</p>
<h3 id="toward-task-c-representations-arise-from-the-posterior-distribution">Toward task C: Representations arise from the posterior distribution</h3>
<p>Simply learning the distribution $p_{\theta}(x, h)$ does not yield a representation <em>per se.</em> To get a distribution of $x$, we need access to the posterior $p_{\theta}(h \mid x)$: then a sample from this posterior can be used as a “representation” of a data-point $x$. (Aside: Sometimes, in settings when $p_{\theta}(h \mid x)$ has a simple description, this description can be viewed as the representation of $x$.)</p>
<p>Thus solving Task C requires learning distribution parameters $\theta$ <em>and</em> figuring out how to efficiently sample from the posterior distribution.</p>
<p>Note that the sampling problems for the posterior can be #-P hard for very simple families. The reason is that by Bayes law, $p_{\theta}(h \mid x) = \frac{p_{\theta}(h) p_{\theta}(x \mid h)}{p_{\theta}(x)}$. Even if the numerator is easy to calculate, as is the case for simple families, the $p_{\theta}(x)$ involves a big summation (or integral) and is often hard to calculate.</p>
<p>Note that the max-likelihood parameter estimation (Task A) and approximating the posterior distributions $p(h \mid x)$ (Task C) can have radically different complexities: Sometimes A is easy but C is NP-hard (example: topic modeling with “nice” topic-word matrices, but short documents, see also <a href="https://arxiv.org/abs/1411.6156">Bresler 2015</a>); or vice versa (example: topic modeling with long documents, but worst-case chosen topic matrices <a href="https://arxiv.org/abs/1111.0952">Arora et al. 2011</a>)</p>
<p>Of course, one may hope (as usual) that computational complexity is a worst-case notion and may not apply in practice. But there is a bigger issue with this setup, having to do with accuracy.</p>
<h2 id="why-the-above-reasoning-is-fragile-need-for-high-accuracy">Why the above reasoning is fragile: Need for high accuracy</h2>
<p>The above description assumes that the parametric model $p_{\theta}(x, h)$ for the data was <em>exact</em> whereas one imagines it is only <em>approximate</em> (i.e., suffers from modeling error). Furthermore, computational difficulties may restrict us to use approximately correct inference even if the model were exact. So in practice, we may only have an <em>approximation</em> $q(h|x)$ to
the posterior distribution $p_{\theta}(h \mid x)$. (Below we describe a popular methods to compute such approximations.)</p>
<blockquote>
<p><em>How good of an approximation</em> to the true posterior do we need?</p>
</blockquote>
<p>Recall, we are trying to answer this question through the lens of Task C, solving some classification task. We take the following point of view:</p>
<blockquote>
<p>For $t=1, 2,\ldots,$ nature picked some $(h_t, x_t)$ from the joint distribution and presented us $x_t$. The true label $y_t$ of $x_t$ is $\mathcal{C}(h_t)$ where $\mathcal{C}$ is an unknown classifier. Our goal is classify according to these labels.</p>
</blockquote>
<p>To simplify notation, assume the output of $\mathcal{C}$ is binary. If we wish to use
$q(h \mid x)$ as a surrogate for the true posterior $p_{\theta}(h \mid x)$, we need to have
$\Pr_{x_t, h_t \sim q(\cdot \mid x_t)} [\mathcal{C}(h_t) \neq y_t]$ is small as well.</p>
<p>How close must $q(h \mid x)$ and $p(h \mid x)$ be to let us conclude this? We will use KL divergence as “distance” between the distributions, for reasons that will become apparent in the following section. We claim the following:</p>
<blockquote>
<p>CLAIM: The probability of obtaining different answers on classification tasks done using the ground truth $h$ versus the representations obtained using $q(h_t \mid x_t)$ is less than $\epsilon$ if $KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t)) \leq 2\epsilon^2.$</p>
</blockquote>
<p>Here’s a proof sketch. The natural distance these two distributions $q(h \mid x)$ and $p(h \mid x)$ with respect to accuracy of classification tasks is <em>total variation (TV)</em> distance. Indeed, if the TV distance between $q(h\mid x)$ and $p(h \mid x)$ is bounded by $\epsilon$, this implies that for any event $\Omega$,</p>
<script type="math/tex; mode=display">\left|\Pr_{h_t \sim p(\cdot \mid x_t)}[\Omega] - \Pr_{h_t \sim q(\cdot \mid x_t)}[\Omega]\right| \leq \epsilon .</script>
<p>The CLAIM now follows by instantiating this with the event $\Omega = $ “Classifier $\mathcal{C}$ outputs a different answer from $y_t$ given representation $h_t$ for input $x_t$”, and relating TV distance to KL divergence using <a href="https://en.wikipedia.org/wiki/Pinsker%27s_inequality">Pinsker’s inequality</a>, which gives</p>
<script type="math/tex; mode=display">\mbox{TV}(q(h_t \mid x_t),p(h_t \mid x_t)) \leq \sqrt{\frac{1}{2} KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t))} \leq \epsilon</script>
<p>as we needed. This observation explains why solving Task A in practice does not automatically lead to very useful representations for classification tasks (Task C): the posterior distribution has to be learnt extremely accurately, which probably didn’t happen (either due to model mismatch or computational complexity).</p>
<h2 id="the--link-between-tasks-a-and-c-variational-methods">The link between Tasks A and C: variational methods</h2>
<p>As noted, distribution learning (Task A) via cross-entropy/maximum-likelihood fitting, and representation learning (Task C) via sampling the posterior are fairly distinct. Why do students often conflate the two? Because in practice the most frequent way to solve Task A does implicitly compute posteriors and thus also solves Task C.</p>
<p>The generic way to learn latent variable models involves variational methods, which can be viewed as a generalization of the famous EM algorithm (<a href="http://web.mit.edu/6.435/www/Dempster77.pdf">Dempster et al. 1977</a>).</p>
<p>Variational methods maintain at all times a <em>proposed distribution</em> $q(h | x)$ (called <em>variational distribution</em>). The methods rely on the observation that for every such $q(h \mid x)$ the following lower bound holds
\begin{equation} \log p(x) \geq E_{q(h \mid x)} \log p(x,h) + H(q(h\mid x)) \qquad (2). \end{equation}
where $H$ denotes Shannon entropy (or differential entropy, depending on whether $x$ is discrete or continuous). The RHS above is often called the <em>ELBO bound</em> (ELBO = evidence-based lower bound). This inequality follows from a bit of algebra using non-negativity of KL divergence, applied to distributions $q(h \mid x)$ and $p(h\mid x)$. More concretely, the chain of inequalities is as follows,</p>
<script type="math/tex; mode=display">KL(q(h\mid x) \parallel p(h \mid x)) \geq 0 \Leftrightarrow E_{q(h|x)} \log \frac{q(h|x)}{p(h|x)} \geq 0</script>
<script type="math/tex; mode=display">\Leftrightarrow E_{q(h|x)} \log \frac{q(h|x)}{p(x,h)} + \log p(x) \geq 0</script>
<script type="math/tex; mode=display">\Leftrightarrow \log p(x) \geq E_{q(h|x)} \log p(x,h) + H(q(h\mid x))</script>
<p>Furthermore, <em>equality</em> is achieved if $q(h\mid x) = p(h\mid x)$. (This can be viewed as some kind of “duality” theorem for distributions, and dates all the way back to Gibbs. )</p>
<p>Algorithmically observation (2) is used by foregoing solving the maximum-likelihood optimization (1), and solving instead</p>
<script type="math/tex; mode=display">\max_{\theta, q(h_t|x_t)} \sum_{t} E_{q(h_t\mid x_t)} \log p(x_t,h_t) + H(q(h_t\mid x_t))</script>
<p>Since the variables are naturally divided into two blocks: the model parameters $\theta$, and the variational distributions $q(h_t\mid x_t)$, a natural way to optimize the above is to <em>alternate</em> optimizing over each group, while keeping the other fixed. (This meta-algorithm is often called variational EM for obvious reasons.)</p>
<p>Of course, optimizing over all possible distributions $q$ is an ill-defined problem, so $q$ is constrained to lie in some parametric family (e.g., “ standard Gaussian transformed by depth $4$ neural nets of certain size and architecture”) such the above objective can be easily evaluated at least (typically it has a closed-form expression).</p>
<p>Clearly if the parametric family of distributions is expressive enough, and the (non-convex) optimization problem doesn’t get stuck in bad local minima, then variational EM algorithm will give us not only values of the parameters $\theta$ which are close to the ground-truth ones, but also variational distributions $q(h\mid x)$ which accurately track $p(h\mid x)$. But as we saw above, this accuracy would need to be very high to get meaningful representations.</p>
<h2 id="next-post">Next Post</h2>
<p>In the next post, we will describe our recent work further clarifying this issue of representation learning via a Bayesian viewpoint.</p>
Tue, 27 Jun 2017 04:00:00 +0000
http://offconvex.github.io/2017/06/27/unsupervised1/
http://offconvex.github.io/2017/06/27/unsupervised1/Generalization and Equilibrium in Generative Adversarial Networks (GANs)<p>The <a href="http://www.offconvex.org/2017/03/15/GANs/">previous post</a> described Generative Adversarial Networks (GANs), a technique for training generative models for image distributions (and other complicated distributions) via a 2-party game between a generator deep net and a discriminator deep net. This post describes <a href="https://arxiv.org/abs/1703.00573">my new paper</a> with Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. We address some fundamental issues about generalization in GANs that have been debated since the beginning; e.g., what is the sense in which the learnt distribution close to the target distribution, and also what kind of equilibrium exists between generator and discriminator.</p>
<p>The usual analysis of GANs, sketched in the previous post, assumes “sufficiently large number of samples and sufficiently large discriminator nets” to conclude that at the end of training the learnt distribution should be close to the target distribution. Our new analysis, which accounts for the finite capacity of the discriminator net, calls this into question.</p>
<p>Readers looking for new implementation ideas can skip ahead to the section below on our <em>Mix + GAN</em> protocol. It takes other GANs codes as black box and (by adding extra capacity and corresponding training time) often improves the learnt distribution in qualitative and quantitative measures. Our testing suggests that it works well out of the box.</p>
<p><strong>Notation</strong> Assume images are represented as vectors in $\Re^d$. Typically $d$ would be $1000$ or much higher. The <em>capacity</em> of the discriminator, namely, the number of trainable parameters, is denoted $n$. The distribution on all real-life images is denoted $P_{real}$. We assume that the number of distinct images in
$P_{real}$ —regardless of how one defines “distinct”—is enormous compared to all these parameters.</p>
<p>Recall that the discriminator $D$ is trained to distinguish between samples from $P_{real}$ and samples from the generator’s distribution
$P_{synth}$. This can be formalized using different measures (leading to different GANs objectives) and for simplicity our exposition here uses the <em>distinguishing probability</em> which is used in <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a> objective:</p>
<script type="math/tex; mode=display">|~E_{x \in P_{real}}[D(x)] - E_{x \in P_{synth}}[D(x)]|\qquad (1).</script>
<p>(Readers with a background in theoretical CS and cryptography will be reminded of similar definitions in theory of pseudorandom generators.)</p>
<h2 id="finite-discriminators-have-limited-power">Finite Discriminators have limited power</h2>
<p>The following simple fact shows why ignoring the discriminator’s capacity constraint can lead to grossly incorrect intuition.
(The constant $C$ is fairly small and explained in the paper.)</p>
<blockquote>
<p><strong>Theorem 1</strong> Suppose the discriminator has capacity $n$. Then expression (1) is less than $\epsilon$ when $P_{synth}$ is the following: uniform distribution on a random sample of $C n/\epsilon^2 \log n$ images from $P_{real}$.</p>
</blockquote>
<p>Note that this theorem is not a formalization of the usual failure mode discussed in GANs literature, whereby the generator simply memorizes the training images. The theorem still applies if we allow the discriminator to use a large set of held out images from $P_{real}$, which are completely different than images in $P_{synth}$. Or, if the training set of images is much larger than $C n/\epsilon^2 \log n$ images. Furthermore, common measures of diversity/novelty used in research papers (e.g., pick a random image from $P_{synth}$ and check the “distance” to the nearest neighbor among the training set) are not guaranteed to get around the problem raised by Theorem 1.</p>
<p>Since $Cn/\epsilon^2\log n$ is rather small, this theorem says that a finite-capacity discriminator is unable to even enforce that $P_{synth}$ has large <em>diversity</em>, let alone enforce that $P_{synth}\approx P_{real}$. The theorem does not imply that existing GANs do not <em>in practice</em> generate diverse distributions; merely that the current analyses gives no reason to believe that they do so.</p>
<p>The proof of the Theorem is a standard sampling argument from learning theory: take an $\epsilon$-net in the continuum of all deep nets of capacity $n$ and a fixed architecture, and do a union bound. Please see the paper for details. (Aside: the “$\epsilon^2$” term in Theorem 1 arises from this argument, and is ubiquitous in ML theory.)</p>
<p>Motivated by this theorem, we argue in the paper that the correct way to think about generalization for GANs is not the usual distance functions between distributions such as Jensen-Shannon or Wasserstein, but a new distance we define called <em>Neural net distance</em>. The neural net distance measures the ability of finite-capacity deep nets to distinguish the distributions. It can be small even when the other distances are large (as illustrated in the above theorem).</p>
<h3 id="corollary-larger-training-sets-have-limited-utility">Corollary: Larger training sets have limited utility</h3>
<p>In fact theorem 1 has the following rephrasing. Suppose we have a very large training set of images. If the discriminator has capacity $n$, then it suffices to take a subsample of size $C n/\epsilon^2 \log n$ from this training set, and we are guaranteed that GANs training using this subsample is capable of achieving a training objective that is within $\epsilon$ of the best achieved with the full training set. ( Any more samples can only improve the training objective by at most $\epsilon$. )</p>
<h2 id="existence-of-equilibrium">Existence of Equilibrium</h2>
<p>Let’s recall the objective used in GAN (for simplicity, we again stick with the Wasserstein GAN):</p>
<p><script type="math/tex">\min_{G} \max_{D}~~E_{x\sim P_{real}}[D(x)] - E_{h}[D(G(h))] \qquad (1)</script>
where $G$ is the generator net, and $P_{synth}$ is the distribution of $G(h)$ where $h$ is a random seed. Researchers have noted that this is implicitly a $2$-person game and it may not have an equilibrium; e.g., see the discussion around Figure 22 in <a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s survey</a>. An <em>equilibrium</em> corresponds to a $G$ and a $D$ such that the pair are still a solution if we switch the order of min and max in (1). (That is, $G$ doesn’t have incentive to switch in response to $D$, and vice versa.) Lack of equilibrium may be a cause of oscillatory behavior observed in training.</p>
<p>But ideally we wish to show something stronger than mere <em>existence</em> of equilibrium: we wish to exhibit an equilibrium where the generator <em>wins</em>, with the objective above at zero or close to zero (in other words, discriminator is unable to distinguish between the two distributions).</p>
<p>We will prove existence of an $\epsilon$-<em>approximate equilibrium</em>, whereby switching the order of $G, D$ affects the expression by at most $\epsilon$. (That is, $G$ has only limited incentive to switch in respnse to $D$ and vice versa.) Naively one would imagine that proving such a result involves some insight into the distribution $P_{real}$ but surprisingly none is needed.</p>
<blockquote>
<p><strong>Theorem 2</strong> If a generator net of capacity $T$ is able to generate a Gaussian distribution in $\Re^d$, then there exists an $\epsilon$-approximate equilibrium in the game where the generator has capacity $O(n T\log n/\epsilon^2 )$.</p>
</blockquote>
<p><em>Proof sketch:</em> A classical result in nonparametric statistics states that $P_{real}$ can be well-approximated by an <em>infinite</em> mixture of standard Gaussians. Now take a sample of size $O(n\log n/\epsilon^2)$ from this infinite mixture, and let $G$ be a uniform mixture on this finite sample of Gaussians. By an argument similar to Theorem 1, the distribution output by $G$ will be indistinguishable from $P_{real}$ by every deep net of capacity $n$. Finally, fold in this mixture of $O(n\log n/\epsilon^2)$ Gaussians into a single generator by using a small “selector” circuit that selects between these with the correct probability.</p>
<p>This theorem only shows <em>existence</em> of a particular equilibrium. What a GAN may actually find in practice using backpropagation is not addressed.</p>
<p>Finally, if we are interested in objectives other than Wasserstein GAN, then a similar proof can show the existence of an
$\epsilon$-approximate <em>mixed</em> equilibrium, namely, where the discriminator and generator are themselves small mixtures of
deep nets.</p>
<p><em>Aside:</em> The sampling idea in this proof goes back to <a href="http://dl.acm.org/citation.cfm?id=195447">Lipton and Young 1994</a>.
Similar ideas have also appeared in study of pseudorandomness (see <a href="http://ieeexplore.ieee.org/document/5231258/citations">Trevisan et al 2009</a>) and model criticism (see <a href="http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf">Gretton et al 2012</a>.)</p>
<h2 id="mix--gan-protocol">MIX + GAN protocol</h2>
<p>Our theory shows that using a mixture of (not too many) generators and discriminators guarantees existence of approximate equilibrium. This suggests that GANs training may be better and more stable we replace the simple generator and discriminator with mixtures of generators.</p>
<p>Of course, it is impractical to use very large mixtures, so we propose <strong>MIX + GAN</strong>: use a mixture of $k$ components, where $k$ is as large as allowed by size of GPU memory. Namely, train a mixture of $k$ generators ${G_{u_i}, i\in [k]}$ and $k$ discriminators ${D_{v_i}, i\in [k]}$). All components of the mixture share the same network architecture but have their own trainable parameters. Maintaining a mixture means of course maintaining a weight $w_{u_i}$ for the generator $G_{u_i}$ which corresponds to the probability of selecting the output of $G_{u_i}$. These weights are also updated via backpropagation. This heuristic can be combined with existing
methods like DC-GAN, W-GAN etc., giving us new training methods MIX + DC-GAN, MIX + W-GAN etc.</p>
<p>Some other experimental details: store the mixture probabilities using logarithm and use <a href="https://users.soe.ucsc.edu/~manfred/pubs/J36.pdf">exponentiated gradient</a> to update. Use an entropy regularizer to prevent collapse of the mixture to single component. All of these are theoretically justified if $k$ were very large, and are only heuristic when $k$ is as small as $5$.</p>
<p>We show that MIX+GAN can improve performance qualitatively (i.e., the images look better) and also quantitatively using popular measures such as Inception score.</p>
<div style="text-align:center;">
<img style="width:375px;" src="http://www.cs.princeton.edu/~arora/img/celeA_dcgan.png" /> $\quad$ <img style="width:375px;" src="http://www.cs.princeton.edu/~arora/img/celeA_mixgan.png" />
</div>
<p>Note that using a mixture increases the capacity of the model by a factor $k$, so it may not be entirely fair to compare the performance of MIX + X with X. On the other hand, in general it is not easy to get substantial performance benefit from increasing deep net capacity (in fact obvious ways of adding capacity that we tried actually reduced performance) whereas here the benefit happens out of the box.</p>
<p>Note that a mixture of generators or discriminators has been used in several recent works (cited in our paper), but we are not aware of any attempts to use a trainable mixture as above.</p>
<h2 id="take-away-lessons">Take-Away Lessons</h2>
<p>Complete understanding of GANs is challenging since we cannot even fully analyse simple backpropagation, let alone backpropagation combined with game-theoretic complications.</p>
<p>We therefore set aside issues of algorithmic convergence and focused on generalization and equilibrium, which focus on the maximum value of the objective. Our analysis suggests the following:</p>
<p>(a) Current GANs training uses a finite capacity deep net to distinguish between synthetic and real distributions. This training criterion by itself seems insufficient to ensure even good <em>diversity</em> in the synthetic distribution, let alone that it is actually very closes to $P_{real}$. (Theorem 1) A first step to fix this would be to focus on ways to ensure higher diversity, which is a necessary step towards ensuring $P_{synth} \approx P_{real}$.</p>
<p>(b) Our results seem to pose a conundrum about the GANs idea which I personally have not been able to resolve. Usually, we believe that adding capacity to the generator allows it gain representation power to model more fine-grained facts about the world and thus produce more realistic and diverse distributions. The downside to adding capacity is <em>overfitting</em>, which can be mitigated using more training images. Thus one imagines that the ideal configuration is:</p>
<blockquote>
<p>Number of training images > Generator capacity > Discriminator capacity.</p>
</blockquote>
<p>Theorem 1 suggests that if discriminator has capacity $n$ then it seems to derive very little benefit (at least in terms of the training objective) from a training set of more than $C (n\log n)/\epsilon^2$ images. Furthermore, there exist equilibria where the generator’s distribution is not too diverse.</p>
<p>So how can we change GANs training so that it ensures $P_{synth}$ having high diversity? Some possibilities are
(a) cap the generator capacity to be much below discriminator capacity. This might work but I don’t see a mathematical reason why. It certainly flies against the usual intuition that —so long as training dataset is large enough—more capacity allows generators to produce more realistic images. (b) high diversity results from some as-yet unknown property of back propagation algorithm (c) Change GANs setup in some other way.</p>
<p>At the very least our paper suggests that an explanation for good performance in GANs must draw upon some delicate interplay of the power of generator vs discriminator and the backpropagation algorithm. This fact was overlooked in previous analyses which assumed discriminators of infinite capacity.</p>
<p>(<em>I thank Moritz Hardt, Kunal Talwar, and Luca Trevisan for their comments and help with references.</em>)</p>
Thu, 30 Mar 2017 18:00:00 +0000
http://offconvex.github.io/2017/03/30/GANs2/
http://offconvex.github.io/2017/03/30/GANs2/Generative Adversarial Networks (GANs), Some Open Questions<p>Since ability to generate “realistic-looking” data may be a step towards understanding its structure and exploiting it, generative models are an important component of unsupervised learning, which has been a frequent theme on this blog. Today’s post is about Generative Adversarial Networks (GANs), introduced in 2014 by <a href="https://arxiv.org/abs/1406.2661">Goodfellow et al.</a>, which have quickly become very popular way to train generative models for complicated real-life data. It involves a game-theoretic tussle between a generator player and a discriminator player, which is very attractive and may be useful in other settings.</p>
<p>This post describes GANs and raises some open questions about them. The next post will describe <a href="https://arxiv.org/abs/1703.00573">our recent paper</a> addressing these questions.</p>
<p>A generative model $G$ can be seen as taking a random seed $h$ (say, a sample from a multivariate Normal distribution) and converting it into an output string $G(h)$ that “looks” like a real datapoint. Such models are popular in classical statistics but the simpler ones like Gaussian Mixtures or Dirichlet Processes seem insufficient for modeling complicated distributions on natural images or natural language. Generative models are also popular in statistical physics, e.g., Ising models and their cousins. These physics models migrated into machine learning and neuroscience in the 1980s and 1990s, which led to a new generative view of neural nets (e.g., Hinton’s <a href="https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">Restricted Boltzmann Machines</a>) which in turn led to multilayer generative models such as stacked denoising autoencoders and variational autoencoders. At their heart, these are nothing but multilayer neural nets that transform the random seed into an output that looks like a realistic image. The primary differences in the model concern details of training. Here is the obligatory set of generated images (source: <a href="https://openai.com/blog/generative-models/">OpenAI blog</a>)</p>
<div style="text-align:center;">
<img style="height=350px" src="https://openai.com/assets/research/generative-models/gans-2-6345b04cb02f720a95ea4cb9483e2fd5a5f6e46ec6ea5bbefadf002a010cda82.jpg" />
</div>
<h2 id="gans-the-basic-framework">GANs: The basic framework</h2>
<p>GANs also train a deep net $G$ to produce realistic images, but the new and beautiful twist lies in a novel training procedure.</p>
<p>To understand the new twist let’s first discuss what it could mean for the output to “look” realistic. A classic evaluation for generative models is <a href="https://en.wikipedia.org/wiki/Perplexity"><em>perplexity</em></a>: a measure of the amount of probability it gives to actual images. This requires that the generative model must be accompanied by an algorithm that computes the probability density function for the generated distribution (i.e., given any image, it must output an estimate of the probability that the model outputs this image.) I might do a future blog post discussing pros and cons of the perplexity measure, but today let’s instead dive straight to GANs, which sidestep the need for perplexity computations.</p>
<blockquote>
<p><strong>Idea 1:</strong> Since deep nets are good at recognizing images —e.g., distinguishing pictures of people from pictures of cats—why not let a deep net be the judge of the outputs of a generative model?</p>
</blockquote>
<p>More concretely, let $P_{real}$ be the distribution over real images, and $P_{synth}$ the one output by the model (i.e., the distribution of $G(h)$ when $h$ is a random seed). We could try to train a discriminator deep net $D$ that maps images to numbers in $[0,1]$ and tries to discriminate between these distributions in the following sense. Its
expected output $E_{x}[D(x)]$ as high as possible when $x$ is drawn from $P_{real}$ and
as low as possible when $x$ is drawn from $P_{synth}$. This training can be done with the <a href="http://www.offconvex.org/2016/12/20/backprop/">usual backpropagation</a>. If the two distributions are identical then of course no such deep net can exist, and so the training will end in failure. If on the other hand we are able to train a good discriminator deep net —one whose average output is noticeably different between real and synthetic samples— then this is proof positive that the two distributions are different. (There is an in-between case, whereby the distributions are different but the discriminator net doesn’t detect a difference. This is going to be important in the story in the next post.) A natural next question is whether the ability to train such a discriminator deep net can help us improve the generative model.</p>
<blockquote>
<p><strong>Idea 2:</strong> If a good discriminator net has been trained, use it to provide “gradient feedback” that improves the generative model.</p>
</blockquote>
<p>Let $G$ denote the Generator net, which means that samples in $P_{synth}$ are obtained by sampling a uniform gaussian seed $h$ and computing $G(h)$. The natural goal for the generator is to make $E_{h}[D(G(h))]$ as high as possible, because that means it is doing better at fooling the discriminator $D$. So if we fix $D$ the natural way to improve $G$ is to pick a few random seeds $h$, and slightly adjust the trainable parameters of $G$ to increase this objective. Note that this gradient computation involves backpropagation through the composed net $D(G(\cdot))$).</p>
<p>Of course, if we let the generator improve itself, it also makes sense to then let the discriminator improve itself too, Which leads to:</p>
<blockquote>
<p><strong>Idea 3:</strong> Turn the training of the generative model into a game of many moves or alternations.</p>
</blockquote>
<p>Each move for the discriminator consists of taking a few samples from $P_{real}$ and $P_{synth}$ and improving its ability to discriminate between them. Each move for the generator consists of producing a few samples from $P_{synth}$ and updating its parameters so that $E_{u}[D(G(h))]$ goes up a bit.</p>
<p>Notice, the discriminator always uses the generator as a black box —i.e., never examines its internal parameters —whereas the generator needs the discriminator’s parameters to compute its gradient direction. Also, the generator does not ever use real images from $P_{real}$ for its computation. (Though of course it does rely on the real images indirectly since the discriminator is trained using them.)</p>
<h2 id="gans-more-details">GANS: More details</h2>
<p>One can fill in the above framework in multiple ways. The most obvious is that the generator could try to maximize $E_{u}[f(D(G(h)))]$ where $f$ is some increasing function. (We call this the <em>measuring function.</em>) This has the effect of giving different importance to different samples. Goodfellow et al. originally used $f(x)=\log (x)$, which, since the derivative of $\log x$ is $1/x$, implicitly gives much more importance to synthetic data $G(u)$ where the discriminator outputs very low values $D(G(h))$. In other words, using $f(x) =\log x$ makes the training more sensitive to instances which the discriminator finds terrible than to instances which the discriminator finds so-so. By contrast, the above sketch implicitly used $f(x) =x$, which gives the same importance to all samples and appears in the recent <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a>.</p>
<p>The discussion thus leads to the following mathematical formulation, where $D, G$ are deep nets with specified architecture and whose number of parameters is fixed in advance by the algorithm designer.</p>
<script type="math/tex; mode=display">\min_{G} \max_{D}~~E_{x\sim P_{real}}[f(D(x))] + E_{h}[f(1-D(G(h)))]. \qquad (1)</script>
<p>There is now a big industry of improving this basic framework using various architectures and training variations, e.g. (a random sample; possibly missing some important ones): <a href="https://arxiv.org/abs/1511.06434v2">DC-GAN</a>, <a href="https://arxiv.org/abs/1612.04357">S-GAN</a>, <a href="https://arxiv.org/abs/1609.04802">SR-GAN</a>, <a href="https://arxiv.org/abs/1606.03657">INFO-GAN</a>, etc.</p>
<p>Usually, the training is continued until the generator wins, meaning the discriminator’s expected output on samples from $P_{real}$ and $P_{synth}$ becomes the same. But a serious practical difficulty is that training in practice is oscillatory, and the above objective is observed to go up and down. This is unlike usual deep net training, where training (at least in cases where it works) steadily improves the objective.</p>
<h2 id="gans-some-open-questions">GANS: Some open questions</h2>
<p>(a) <em>Does an equilibrium exist?</em></p>
<p>Since GAN is a 2-person game, the oscillatory behavior mentioned above is not unexpected. Just as a necessary condition for gradient descent to come to a stop is that the current point is a stationary point (ie gradient is zero), the corresponding situation in a 2-person game is an <em>equilibrium</em>: each player’s move happens to be its optimal response to the other’s move. In other words, switching the order of $\min$ and $\max$ in expression (1) doesn’t change the objective. The GAN formulation above needs a so-called pure equilibrium, which may not exist in general. A simple example is the classic rock/paper/scissors game. Regardless of whether one player plays rock, paper or scissor as a move, the other can counter with a move that beats it. Thus no pure equilibrium exists.</p>
<p>(b) <em>Does an equilibrium exist where the generator wins, i.e. discriminator ends up unable to distinguish the two distributions on finite samples?</em></p>
<p>(c) <em>Suppose the generator wins. What does this say about whether or not</em> $P_{real}$ <em>is close to</em> $P_{synth}$ ?</p>
<p>Question (c) has dogged GANs research from the start. Has the generative model actually learned something meaningful about real life images, or is it somehow memorizing existing images and presenting trivial modifications? (Recall that $G$ is never exposed directly to real images, so any “memorizing” has to be happen via the gradient propagated through the discriminator.)</p>
<p>If generator’s win does indeed say that $P_{real}$ and $P_{synth}$ are close then we think of the GANs training as <em>generalizing.</em> (This by analogy to the usual notion of generalization for supervised learning.)</p>
<p>In fact, the next post will show that this issue is indeed more subtle than hitherto recognized. But to complete the backstory I will summarize how this issue has been studied so far.</p>
<h2 id="past-efforts-at-understanding-generalization">Past efforts at understanding generalization</h2>
<p>The original paper of Goodfellow et al. introduced an analysis of generalization —adopted since by other researchers— that works when deep nets are trained “sufficiently high capacity, samples and training time” (to use their phrasing).</p>
<p>For the original objective function with $f(x) =\log x$ if the optimal discriminator is allowed to be any function all (i.e., not just one computable by a finite capacity neural net) it can be checked that the optimal choice is $D(x) = P_{real}(x)/(P_{real}(x)+P_{synth}(x))$.
Substituting this in the GANs objective, up to linear transformation the maximum value achieved by discriminator turns out to be
equivalent to the <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">Jensen-Shannon (JS) divergence</a> between $P_{real}$ and $P_{synth}$.
Hence if a generator wins the game against this <em>ideal</em> discriminator on a <em>very large</em> number of samples, then $P_{real}$ and $P_{synth}$ are close in JS divergence, and thus the model has learnt the true distribution.</p>
<p>A similar analysis for <a href="https://arxiv.org/abs/1701.07875">Wasserstein GANs</a> shows that if the generator wins using the Wasserstein objective (i.e., $f(x) =x$) then the two distributions are close in <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein or earth-mover distance</a>.</p>
<p>But we will see in the next post that these analyses can be misleading because in practice, deep nets have (very) finite capacity and
sample size. Thus even if training produces the optimal discriminator, the above analyses can be very off.</p>
<h2 id="further-resources">Further resources</h2>
<p>OpenAI has a <a href="https://openai.com/blog/generative-models/">brief survey of recent approaches</a> to generative models. The <a href="http://www.inference.vc/">inFERENCe blog</a> has many articles on GANs.</p>
<p><a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s survey</a> is the most authoritative account of this burgeoning field, and gives tons of insight. The text around Figure 22 discusses oscillation and lack of equilibria.
He also discusses how GANs trained on a broad spectrum of images seem to get confused and output images that are realistic at the micro level but nonsensical overall; e.g., an animal with a leg coming out of its head. Clearly this field, despite its promise, has many open questions!</p>
Wed, 15 Mar 2017 13:00:00 +0000
http://offconvex.github.io/2017/03/15/GANs/
http://offconvex.github.io/2017/03/15/GANs/Back-propagation, an introduction<p>Given the sheer number of backpropagation tutorials on the internet, is there really need for another? One of us (Sanjeev) recently taught backpropagation in <a href="https://www.cs.princeton.edu/courses/archive/fall16/cos402/">undergrad AI</a> and couldn’t find any account he was happy with. So here’s our exposition, together with some history and context, as well as a few advanced notions at the end. This article assumes the reader knows the definitions of gradients and neural networks.</p>
<h2 id="what-is-backpropagation">What is backpropagation?</h2>
<p>It is the basic algorithm in training neural nets, apparently independently rediscovered several times in the 1970-80’s (e.g., see Werbos’ <a href="https://www.researchgate.net/publication/35657389_Beyond_regression_new_tools_for_prediction_and_analysis_in_the_behavioral_sciences">Ph.D. thesis</a> and <a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471598976.html">book</a>, and <a href="http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html">Rumelhart et al.</a>). Some related ideas existed in control theory in the 1960s. (One reader points out another independent rediscovery, the Baur-Strassen lemma from 1983.)</p>
<p>Backpropagation gives a fast way to compute the sensitivity of the output of a neural network to all of its parameters while keeping the inputs of the network fixed: specifically it computes all partial derivatives ${\partial f}/{\partial w_i}$ where $f$ is the output and $w_i$ is the $i$th parameter. (Here <em>parameters</em> can be edge weights or biases associated with nodes or edges of the network, and the precise details of the node computations —e.g., the precise form of nonlinearity like Sigmoid or RELU— are unimportant.) Doing so gives the <em>gradient</em> $\nabla f$ of $f$ with respect to its network parameters, which allows a <em>gradient descent</em> step in the training: change all parameters simultaneously to move the vector of parameters a small amount in the direction $-\nabla f$.</p>
<p>Note that backpropagation computes the gradient exactly, but properly training neural nets needs many more tricks than just backpropagation. Understanding backpropagation is useful for appreciating some advanced tricks.</p>
<p>The importance of backpropagation derives from its efficiency. Assuming node operations take unit time, the running time is <em>linear</em>, specifically, $O(\text{Network Size}) = O(V + E)$, where $V$ is the number of nodes in the network and $E$ is the number of edges. The only technical ingredient is chain rule from calculus, but applying it naively would have resulted in quadratic running time—which would be hugely inefficient for networks with millions or even thousands of parameters.</p>
<p>Backpropagation can be efficiently implemented using highly parallel vector operations available in today’s <a href="https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units">GPUs (Graphical Processing Units)</a>, which play an important role in the the recent neural nets revolution.</p>
<p><strong>Side Note:</strong> Expert readers will recognize that in the standard accounts of neural net training,
the actual quantity of interest is the gradient of the <em>training loss</em>, which happens to be a simple function of the network output. But the above phrasing is fully general since one can simply add a new output node to the network that computes the training loss from the old output. Then the quantity of interest is indeed the gradient of this new output with respect to network parameters.</p>
<h2 id="problem-setup">Problem Setup</h2>
<p>Backpropagation applies only to acyclic networks with directed edges. (Later we briefly sketch its use on networks with cycles.)</p>
<p>Without loss of generality, acyclic networks can be visualized as being structured in numbered layers, with nodes in the $t+1$th layer getting all their inputs from the outputs of nodes in layers $t$ and earlier. We use $f \in \mathbb{R}$ to denote the output of the network. In all our figures, the input of the network is at the bottom and the output on the top.</p>
<p>We start with a simple claim that reduces the problem of computing the gradient to the problem of computing partial derivatives with respect to the nodes:</p>
<blockquote>
<p><strong>Claim 1:</strong> To compute the desired gradient with respect to the parameters, it suffices to compute $\partial f/\partial u$ for every node $u$.</p>
</blockquote>
<p>Let’s be clear what $\partial f/\partial u$ means. Suppose we cut off all the incoming edges of the node $u$, and fix/clamp the current values of all network parameters. Now imagine changing $u$ from its current value. This change may affect values of nodes at higher levels that are connected to $u$, and the final output $f$ is one such node. Then $\partial f/\partial u$ denotes the rate at which $f$ will change as we vary $u$. (Aside: Readers familiar with the usual exposition of back-propagation should note that there $f$ is the training error and this $\partial f/\partial u$ turns out to be exactly the “error” propagated back to on the node $u$.)</p>
<p>Claim 1 is a direct application of chain rule, and let’s illustrate it for a simple neural nets (we address more general networks later). Suppose node $u$ is a weighted sum of the nodes $z_1,\dots, z_m$ (which will be passed through a non-linear activation $\sigma$ afterwards). That is, we have $u = w_1z_1+\dots+w_nz_n$. By Chain rule, we have</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial w_1} = \frac{\partial f}{\partial u}\cdot \frac{\partial{u}}{\partial w_1} = \frac{\partial f}{\partial u}\cdot z_1.</script>
<p>Hence, we see that having computed $\partial f/\partial u$ we can compute $\partial f/\partial w_1$, and moreover this can be done locally by the endpoints of the edge where $w_1$ resides.</p>
<div style="text-align:center;">
<img style="width:500px;" src="http://www.cs.princeton.edu/~tengyu/forblog/weight5.jpg" />
</div>
<h3 id="multivariate-chain-rule">Multivariate Chain Rule</h3>
<p>Towards computing the derivatives with respect to the nodes, we first recall the multivariate Chain rule, which handily describes the relationships between these partial derivatives (depending on the graph structure).</p>
<p>Suppose a variable $f$ is a function of variables $u_1,\dots, u_n$, which in turn depend on the variable $z$. Then, multivariate Chain rule says that</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial z} = \sum_{j=1}^n \frac{\partial f}{\partial u_j}\cdot \frac{\partial u_j}{\partial z}~.</script>
<p>This is a direct generalization of eqn. (2) and a sub-case of eqn. (11) in this <a href="http://mathworld.wolfram.com/ChainRule.html">description of chain rule</a>.</p>
<p>This formula is perfectly suitable for our cases. Below is the same example as we used before but with a different focus and numbering of the nodes.</p>
<div style="text-align:center;">
<img style="width:500px;" src="http://www.cs.princeton.edu/~tengyu/forblog/chain_rule_5.jpg" />
</div>
<p>We see that given we’ve computed the derivatives with respect to all the nodes that is above the node $z$, we can compute the derivative with respect to the node $z$ via a weighted sum, where the weights involve the local derivative ${\partial u_j}/{\partial z}$ that is often easy to compute. This brings us to the question of how we measure running time. For book-keeping, we assume that</p>
<blockquote>
<p><strong>Basic assumption:</strong> If $u$ is a node at level $t+1$ and $z$ is any node at level $\leq t$ whose output is an input to $u$, then computing $\frac{\partial u}{\partial z}$ takes unit time on our computer.</p>
</blockquote>
<h3 id="naive-feedforward-algorithm-not-efficient">Naive feedforward algorithm (not efficient!)</h3>
<p>It is useful to first point out the naive quadratic time algorithm implied by the chain rule. Most authors skip this trivial version, which we think is analogous to teaching sorting using only quicksort, and skipping over the less efficient bubblesort.</p>
<p>The naive algorithm is to compute $\partial u_i/\partial u_j$ for every pair of nodes where $u_i$ is at a higher level than $u_j$. Of course, among these $V^2$ values (where $V$ is the number of nodes) are also the desired ${\partial f}/{\partial u_i}$ for all $i$ since $f$ is itself the value of the output node.</p>
<p>This computation can be done in feedforward fashion. If such value has been obtained for every $u_j$ on the level up to and including level $t$, then one can express (by inspecting the multivariate chain rule) the value $\partial u_{\ell}/\partial u_j$ for some $u_{\ell}$ at level $t+1$ as a weighted combination of values $\partial u_{i}/\partial u_j$ for each $u_i$ that is a direct input to $u_{\ell}$. This description shows that the amount of computation for a fixed $j$ is proportional to the number of edges $E$. This amount of work happens for all $V$ values of $j$, letting us conclude that the total work in the algorithm is $O(VE)$.</p>
<h2 id="backpropagation-linear-time">Backpropagation (Linear Time)</h2>
<p>The more efficient backpropagation, as the name suggests, computes the partial derivatives in the reverse direction. Messages are passed in one wave backwards from higher number layers to lower number layers. (Some presentations of the algorithm describe it as dynamic programming.)</p>
<blockquote>
<p><strong>Messaging protocol:</strong>
The node $u$ receives a message along each outgoing edge from the node at the other end of that edge. It sums these messages to get a number $S$ (if $u$ is the output of the entire net, then define $S=1$) and then it sends the following message to any node $z$ adjacent to it at a lower level:
<script type="math/tex">S \cdot \frac{\partial u}{\partial z}</script></p>
</blockquote>
<p>Clearly, the amount of work done by each node is proportional to its degree, and thus overall work is the sum of the node degrees. Summing all node degrees counts each edge twice, and thus the overall work is
$O(\text{Network Size})$.</p>
<p>To prove correctness, we prove the following:</p>
<blockquote>
<p><strong>Main Claim</strong>: At each node $z$, the value $S$ is exactly ${\partial f}/{\partial z}$.</p>
</blockquote>
<p><em>Base Case</em>: At the output layer this is true, since ${\partial f}/{\partial f} =1$.</p>
<p><em>Inductive case</em>: Suppose the claim was true for layers $t+1$ and higher and $u$ is at layer $t$, with outgoing edges go to some nodes $u_1, u_2, \ldots, u_m$ at levels $t+1$ or higher. By inductive hypothesis, node $z$ indeed receives $ \frac{\partial f}{\partial u_j}\times \frac{\partial u_j}{\partial z}$ from each of $u_j$. Thus by Chain rule,
<script type="math/tex">S= \sum_{i =1}^m \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial z}=\frac{\partial f}{\partial z}.</script>
This completes the induction and proves the Main Claim.</p>
<h2 id="auto-differentiation">Auto-differentiation</h2>
<p>Since the exposition above used almost no details about the network and the operations that the node perform, it extends to every computation that can be organized as an acyclic graph whose each node computes a differentiable function of its incoming neighbors. This observation underlies many auto-differentiation packages such as <a href="https://github.com/HIPS/autograd">autograd</a> or <a href="https://www.tensorflow.org/">tensorflow</a>: they allow computing the gradient of the output of such a computation with respect to the network parameters.</p>
<p>We first observe that Claim 1 continues to hold in this very general setting. This is without loss of generality because we can view the parameters associated to the edges as also sitting on the nodes (actually, leaf nodes). This can be done via a simple transformation to the network; for a single node it is shown in the picture below; and one would need to continue to do this transformation in the rest of the networks feeding into $u_1, u_2,..$ etc from below.</p>
<div style="text-align:center;">
<img style="width:800px;" src="http://www.cs.princeton.edu/~tengyu/forblog/change_view" />
</div>
<p>Then, we can use the messaging protocol to compute the derivatives with respect to the nodes, as long as the local partial derivative can be computed efficiently. We note that the algorithm can be implemented in a fairly modular manner: For every node $u$, it suffices to specify (a) how it depends on the incoming nodes, say, $z_1,\dots, z_n$ and (b) how to compute the partial derivative times $S$, that is, $S \cdot \frac{\partial u}{\partial z_j}$.</p>
<p><em>Extension to vector messages</em>: In fact (b) can be done efficiently in more general settings where we allow the output of each node in the network to be a vector (or even matrix/tensor) instead of only a real number. Here we need to replace $\frac{\partial u}{\partial z_j}\cdot S$ by $\frac{\partial u}{\partial z_j}[S]$, which denotes the result of applying the operator $\frac{\partial u}{\partial z_j}$ on $S$. We note that to be consistent with the convention in the usual exposition of backpropagation, when $y\in \mathbb{R}^{p}$ is a funciton of $x\in \mathbb{R}^q$, we use $\frac{\partial y}{\partial x}$ to denote $q\times p$ dimensional matrix with $\partial y_j/\partial x_i$ as the $(i,j)$-th entry. Readers might notice that this is the transpose of the usual Jacobian matrix defined in mathematics. Thus $\frac{\partial y}{\partial x}$ is an operator that maps $\mathbb{R}^p$ to $\mathbb{R}^q$ and we can verify $S$ has the same dimension as $u$ and $\frac{\partial u}{\partial z_j}[S]$ has the same dimension as $z_j$.</p>
<p>For example, as illustrated below, suppose the node $U\in \mathbb{R}^{d_1\times d_3} $ is a product of two matrices $W\in \mathbb{R}^{d_2\times d_3}$ and $Z\in \mathbb{R}^{d_1\times d_2}$. Then we have that $\partial U/\partial Z$ is a linear operator that maps $\mathbb{R}^{d_2\times d_3}$ to $\mathbb{R}^{d_1\times d_3}$, which naively requires a matrix representation of dimension $d_2d_3\times d_1d_3$. However, the computation (b) can be done efficiently because
<script type="math/tex">\frac{\partial U}{\partial Z}[S]= W^{\top}S.</script></p>
<p>Such vector operations can also be implemented efficiently using today’s GPUs.</p>
<div style="text-align:center;">
<img style="width:200px;" src="http://www.cs.princeton.edu/~tengyu/forblog/mult.jpg" />
</div>
<h2 id="notable-extensions">Notable Extensions</h2>
<p>1) <em>Allowing weight tying.</em> In many neural architectures, the designer wants to force many network units such as edges or nodes to share the same parameter. For example, in <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network"><em>convolutional neural nets</em></a>, the same filter has to be applied all over the image, which implies reusing the same parameter for a large set of edges between the two layers.</p>
<p>For simplicity, suppose two parameters $a$ and $b$ are supposed to share the same value. This is equivalent to adding a new node $u$ and connecting $u$ to both $a$ and $b$ with the operation $a = u$ and $b=u$. Thus, by chain rule, <script type="math/tex">\frac{\partial f}{\partial u} = \frac{\partial f}{\partial a}\cdot \frac{\partial a}{\partial u}+\frac{\partial f}{\partial b}\cdot \frac{\partial b}{\partial u} = \frac{\partial f}{\partial a}+ \frac{\partial f}{\partial b}.</script> Hence, equivalently, the gradient with respect to a shared parameter is the sum of the gradients with respect to individual occurrences.</p>
<p>2) <em>Backpropagation on networks with loops.</em> The above exposition assumed the network is acyclic. Many cutting-edge applications such as machine translation and language understanding use networks with directed loops (e.g., recurrent neural networks). These architectures —all examples of the “differentiable computing” paradigm below—can get complicated and may involve operations on a separate memory as well as mechanisms to shift attention to different parts of data and memory.</p>
<p>Networks with loops are trained using gradient descent as well, using <a href="https://en.wikipedia.org/wiki/Backpropagation_through_time">back-propagation through time</a>, which consists of expanding the network through a finite number of time steps into an acyclic graph, with replicated copies of the same
network. These replicas share the weights (weight tying!) so the gradient can be computed. In practice an issue may arise with <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">exploding or vanishing gradients</a> which impact convergence. Such issues can be carefully addressed in practice by clipping the gradient or re-parameterization techniques such as <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">long short-term memory</a>.</p>
<p>The fact that the gradient can be computed efficiently for such general networks with loops has motivated neural net models with memory or even data structures (see for example <a href="https://en.wikipedia.org/wiki/Neural_Turing_machine">neural Turing machines</a> and <a href="https://en.wikipedia.org/wiki/Differentiable_neural_computer">differentiable neural computer</a>). Using gradient descent, one can optimize over a family of parameterized networks with loops to find the best one that solves a certain computational task (on the training examples). The limits of these ideas are still being explored.</p>
<p>3) <em>Hessian-vector product in linear time.</em> It is possible to generalize backprop to enable 2nd order optimization in “near-linear” time, not just gradient descent,
as shown in recent independent manuscripts of <a href="https://arxiv.org/pdf/1611.00756.pdf">Carmon et al.</a> and <a href="https://arxiv.org/pdf/1611.01146.pdf">Agarwal et al.</a> (NB: Tengyu is a coauthor on this one.). One essential step is to
compute the product of the <a href="https://en.wikipedia.org/wiki/Hessian_matrix">Hessian matrix</a> and a vector, for which <a href="http://www.bcl.hamilton.ie/~barak/papers/nc-hessian.pdf">Pearlmutter’93</a> gave an efficient algorithm. Here we show how to do this in $O(\mbox{Network size})$ using the ideas above. We need a slightly stronger version of the back-propagation result than the one in the previous subsection:</p>
<blockquote>
<p><strong>Claim (informal):</strong> Suppose an acyclic network with $V$ nodes and $E$ edges has output $f$ and leaves $z_1,\dots, z_m$. Then there exists a network of size $O(V+E)$ that has $z_1,\dots, z_m$ as input nodes and $\frac{\partial f}{\partial z_1},\dots, \frac{\partial f}{\partial z_m}$ as output nodes.</p>
</blockquote>
<p>The proof of the Claim follows in straightforward fashion from implementing the message passing protocol as an acyclic circuit.</p>
<p>Next we show how to compute $\nabla^2 f(z)\cdot v$ where $v$ is a given fixed vector. Let $g(z)= \langle \nabla f(z),v\rangle$ be a function from $\mathbb{R}^d\rightarrow \mathbb{R}$. Then by the Claim above, $g(z)$ can be computed by a network of size $O(V+E)$. Now apply the Claim again on $g(z)$, we obtain that $\nabla g(z)$ can also be computed by a network of size $O(V+E)$.</p>
<p>Note that by construction,
<script type="math/tex">\nabla g(z) = \nabla^2 f(z)\cdot v.</script>
Hence we have computed the Hessian vector product in network size time.</p>
<p>##That’s all!</p>
<p>Please write your comments on this exposition and whether it can be improved.</p>
Tue, 20 Dec 2016 17:00:00 +0000
http://offconvex.github.io/2016/12/20/backprop/
http://offconvex.github.io/2016/12/20/backprop/The search for biologically plausible neural computation: The conventional approach<p>Inventors of the original artificial neural networks (NNs) derived their inspiration from biology.
However, as artificial NNs progressed, their design was less guided by neuroscience facts.
Meanwhile, progress in neuroscience has altered our conceptual understanding of neurons.
Consequently, we believe that many successful artificial NNs resemble natural NNs only
superficially violating fundamental constraints imposed by biological hardware.</p>
<p>The wide gap between the artificial and natural NN designs raises intriguing questions: What
algorithms underlie natural NNs? Can insights from biology help build better artificial
NNs?</p>
<p>This is the first of a series of posts aimed at explaining recent progress made
by my collaborators and myself towards biologically plausible NNs. Such networks can
serve both as models of natural NNs and as general
purpose artificial NNs. We have found that respecting biological constraints
actually helps development of artificial NNs by guiding design decisions.</p>
<p>In this post, I cover the background material, going back several decades.
I sketch a biological neuron, introduce primary biological constraints, and discuss
the conventional approach to deriving artificial NNs. I will show that while
the conventional approach generates a reasonable algorithmic model of a <em>single</em>
biological neuron, multi-neuron networks violate biological constraints. In future
posts we will see how to fix that.</p>
<h2 id="a-sketch-of-a-biological-neuron">A Sketch of a Biological Neuron</h2>
<p>Here is the minimum biological background needed to understand the rest of the post.</p>
<p>A biological neuron receives signals from multiple neurons, computes their weighted sum and
generates a signal transmitted to multiple neurons, Figure 1. Each neuron’s signaling activity is
quantified by the <em>firing rate</em>, which is a nonnegative real number that varies over time. Each synapse
scales the input from the corresponding upstream neuron onto the receiving neuron by its
weight. The receiving neuron sums scaled inputs, i.e. computes the inner product of the
upstream activity vector and the synaptic weight vector. The inner product passes
through a nonlinearity called the activation function and the output is transmitted
to downstream neurons.</p>
<p>Synaptic weight changes over time, typically, on a slower time scale than neuronal signals. The
weight depends on neuronal signals per so-called learning rules. For example, in commonly
used <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian learning rules</a>, synaptic weight is
proportional to the correlation between the activities of the two neurons a synapse connects,
i.e. pre- and postsynaptic.</p>
<p><img src="https://docs.google.com/uc?export=download&id=0B5Pfjcu55RQQb0F1MllnRTdweDQ" alt="" title="Title" />
Figure 1: A biological neuron modelled by an online algorithm. Left: A biological neuron receives
inputs from the upstream neurons (green) which are scaled by the weights of corresponding
synapses (blue). The neuron (black) computes output, $y$, as a function of the weighted input sum. Right: Online algorithm outputs an activation function of the inner product of the
synaptic weight vector and an upstream activity vector. Synaptic weights are modified by
neuronal activities (dashed line) per learning rules.</p>
<h2 id="primary-biological-constraints">Primary Biological Constraints</h2>
<p>To determine which algorithmic models in this post are biologically plausible,
we can focus on a few key biological constraints.</p>
<p>Biologically plausible algorithms must be formulated in the <strong>online</strong> (or streaming), rather than
offline (or batch), setting. This means that input data are streamed to the algorithm
sequentially, one sample at a time, and the corresponding output must be computed before
the next input sample arrives. The output communicated to downstream neurons cannot be
modified in the future. A neuron cannot store individual past inputs or outputs except in a
highly compressed format limited to synaptic weights and a few state variables.</p>
<p>In biologically plausible NNs, learning rules must be <strong>local</strong>. This means that the synaptic weight
update may depend on the activities of only the two neurons a synapse connects, as for
example, in Hebbian learning. Activities of other neurons are not physically available to a
synapse and therefore including them into learning rules would be biologically
implausible. Modern artificial NNs, such as
backpropagation-based deep learning networks, rely on nonlocal learning rules.</p>
<p>Our initial focus is on <strong>unsupervised</strong> learning. This is not a hard constraint, but rather a matter of
priority. Whereas humans are clearly capable of supervised learning, most of our learning tasks
lack big labeled datasets. On the mechanistic level, most neurons lack a clear supervision signal.</p>
<h2 id="single-neuron-online-principal-component-analysis-pca">Single-neuron Online Principal Component Analysis (PCA)</h2>
<p>In 1982, <a href="https://pdfs.semanticscholar.org/38bc/c5d342accf5249514cdfdaaa40871a93252c.pdf">Oja proposed</a>
modeling a neuron by an online PCA algorithm. PCA is a workhorse of data
analysis used for dimensionality reduction, denoising, and latent factor
discovery. Therefore, Oja’s seminal paper established that biological processes in
a neuron can be viewed as the steps of an online algorithm solving a useful
computational objective.</p>
<p>Oja’s single-neuron online PCA algorithm works as follows. At each time step, $t$,
it receives an input data sample, ${\bf x_t}$, computes and outputs
the corresponding top principal component value, $y_t$:</p>
<script type="math/tex; mode=display">y_t \leftarrow {\bf w} _{t-1}^\top {\bf x}_t. \qquad \qquad \qquad (1.1)</script>
<p>Here and below lowercase boldfaced letters designate vectors. Then the algorithm updates the
(normalized) feature vector,</p>
<script type="math/tex; mode=display">{\bf w} _t \leftarrow {\bf w} _{t-1}+ \eta ( {\bf x} _t- {\bf w} _{t-1} y_t ) y_t. \qquad \qquad (1.2)</script>
<p>The feature vector, ${\bf w}$, converges to the eigenvector of input covariance if data are drawn i.i.d from
a stationary distribution.</p>
<p>The steps of the Oja algorithm (1.1-1.2) correspond to the operations of the biological neuron. If the
input vector is represented by the activities of the upstream neurons, (1.1) represents weighted
summation of the inputs by the output neuron. If the activation function is linear the output, $y_t$,
is simply the weighted sum. The update (1.2) is a local Hebbian synaptic learning rule. The
first term of the update is proportional to the correlation of the pre- and postsynaptic neurons’
activities and the second term, also local, normalizes the synaptic weight vector.</p>
<h2 id="a-normative-theory">A Normative Theory</h2>
<p>Next, we would like to build on Oja’s insightful identification of biological
processes with the steps of the online PCA algorithm by computing multiple
principal components using multi-neuron NNs and including the activation nonlinearity.</p>
<p>Instead of trying to extend the Oja model heuristically, we take a more
systematic, so-called normative approach. In this approach, a biological
model is viewed as the solution of an optimization problem. Specifically, we
postulate an objective function motivated by a computational principle,
derive an online algorithm optimizing such objective, and map the steps
of the algorithm onto biological processes.</p>
<p>Having such normative theory allows us to navigate through the space of
possible algorithmic models in a more efficient and systematic way. Mathematical
compactness of objective functions facilitates generating new models and
weeding out inconsistent ones. This is similar to the Hamiltonian approach
in physics which leverages natural symmetries and safeguards against the violation
of the first law of thermodynamics (energy conservation).</p>
<h2 id="deriving-a-single-neuron-online-pca-using-the-reconstruction-approach">Deriving a Single-neuron Online PCA using the Reconstruction Approach</h2>
<p>To build a normative theory, we first need to derive Oja’s single-neuron online algorithm by
solving an optimization problem. What objective function should we choose for online PCA?
Historically, neural activity has been often viewed as representing each data sample, ${\bf x}_t$, by the
feature vector, ${\bf w}$, scaled by the output, $y_t$, Figure 2. Such reconstruction approach is naturally
formalized as the minimization of the reconstruction (or coding) error:</p>
<script type="math/tex; mode=display">\min_{\| {\bf w} \|=1} {\sum \limits_{t=1}^{T} \min_{ y_t} {\| {\bf x}_t-{\bf w} y_t \| ^2_2}}. \qquad \qquad \qquad \quad (1.3)</script>
<p>In the offline setting, optimization problem (1.3) is solved by PCA: the optimum ${\bf w}$ is the
eigenvector of input covariance corresponding to the top eigenvalue and the optimum output, $y$, is the first principal component.</p>
<p><img src="https://docs.google.com/uc?export=download&id=0B5Pfjcu55RQQaUtLTmFmbjZ3eWs" alt="" /></p>
<p>Figure 2. PCA represents data samples (circles) by their projections (red) onto the top
eigenvector, ${\bf w}$. These projections constitute the top principal component. Objective (1.3)
minimizes the reconstruction error (blue).</p>
<p>In the online setting, (1.3) can be solved by alternating minimization, which has been a subject
of <a href="http://www.offconvex.org/2016/05/08/almostconvexitySATM/">recent analysis</a>. After the arrival of each data point, <script type="math/tex">{\bf x}_t</script> , the algorithm computes optimum
output, $y_t$, while keeping the feature vector, ${\bf w}_{t-1}$, computed at the previous time step, fixed.
By using calculus, one finds that the optimum output is given by (1.1). Then, the algorithm
minimizes the total reconstruction error with respect to the feature vector while keeping all the
outputs fixed. Again, resorting to calculus, one finds (1.2).</p>
<p>Thus, the single-neuron online PCA algorithm may be derived using the reconstruction approach. To
compute multiple principal components, we need to extend this success to multi-neuron
networks.</p>
<h2 id="the-reconstruction-approach-fails-for-multi-neuron-networks">The Reconstruction Approach Fails for Multi-neuron Networks</h2>
<p>Though the reconstruction approach yields a multi-component online PCA
algorithm, the corresponding NNs are <em>not</em> biologically plausible.</p>
<p>Extension of the reconstruction error objective from single to multiple output components is
straightforward - each scalar, $y_t$, is replaced by a vector, ${\bf y}_t$:</p>
<script type="math/tex; mode=display">\min_{\rm diag({\bf W}^\top {\bf W})={\bf I}} {\sum \limits_{t=1}^{T} \min_{ {\bf y}_t} {\| {\bf x}_t-{\bf W} {\bf y}_t \| ^2_2}}. \qquad \qquad \qquad \quad (1.4)</script>
<p>Here matrix ${\bf W}$ comprises column-vectors corresponding to different features. As in the single-
neuron case this objective can be optimized online by alternating minimization. After the arrival
of data sample, ${\bf x}_t$, the feature vectors are kept fixed while the objective (1.4) is minimized
with respect to the principal components by iterating the following update until convergence:</p>
<script type="math/tex; mode=display">{\bf y}_t \leftarrow {\bf W} _{t-1}^\top {\bf x}_t - {\bf W} _{t-1}^\top {\bf W} _{t-1} {\bf y}_t . \qquad \qquad \qquad (1.5)</script>
<p>Minimizing the total objective with respect to the feature vectors for fixed principal
components yields the following update:</p>
<script type="math/tex; mode=display">{\bf W} _t \leftarrow {\bf W} _{t-1}+ \eta ( {\bf x} _t- {\bf W} _{t-1} {\bf y}_t ) {\bf y}_t^\top \cdot \qquad \qquad (1.6)</script>
<p>As before, in NN implementations of algorithm (1.5-1.6), feature vectors are represented by
synaptic weights and principal components by the activities of output neurons. Then (1.5) can
be implemented by a single-layer NN, Figure 3, in which activity dynamics converges
faster than the time interval between the arrival of successive data samples.</p>
<p>However, implementing update (1.6) in the single-layer NN architecture, Figure 3, requires
nonlocal learning rules making it biologically implausible. Indeed, the last term in (1.6) implies
that updating the weight of a synapse requires the knowledge of output activities of all other
neurons which are not available to the synapse. Moreover, the matrix of lateral connection
weights, $- {\bf W} _{t-1}^\top {\bf W} _{t-1}$, in the last term of (1.5) is computed as a Grammian of feedforward weights,
clearly a nonlocal operation. This problem is not limited to PCA and arises in networks of
nonlinear neurons as well.</p>
<p>Rather than deriving learning rules from a principled objective, many authors
constructed biologically plausible single-layer networks using local learning
rules, Hebbian for feedforward and anti-Hebbian (meaning there is a minus sign
in front of the correlation-based synaptic weight as for the last term in (1.5)).
However, in my view, abandoning the normative approach creates more problems
than it solves.</p>
<p><img src="https://docs.google.com/uc?export=download&id=0B5Pfjcu55RQQMlRNVDVBRDJ0TEk" alt="" /></p>
<p>Figure 3. The single-layer NN implementation of the multi-neuron online PCA algorithm derived
using the reconstruction approach requires nonlocal learning rules.</p>
<p>I have outlined how the conventional reconstruction approach fails to generate
biologically plausible multi-neuron networks for online PCA. In the next post,
I will introduce an alternative approach that overcomes this limitation.
Moreover, this approach suggests a novel view of neural computation leading to
many interesting extensions.</p>
<p>(<em>Acknowledgement: I am grateful to Sanjeev Arora for his support and encouragement as well as to Cengiz Pehlevan, Leo Shklovskii, Emily Singer, and Thomas Lin for their comments on the earlier versions.</em>)</p>
Thu, 03 Nov 2016 10:00:00 +0000
http://offconvex.github.io/2016/11/03/MityaNN1/
http://offconvex.github.io/2016/11/03/MityaNN1/Gradient Descent Learns Linear Dynamical Systems<p>From text translation to video captioning, learning to map one sequence to another is an increasingly active research area in machine learning. Fueled by the success of recurrent neural networks in its many variants, the field has seen rapid advances over the last few years. Recurrent neural networks are typically trained using some form of stochastic gradient descent combined with backpropagation for computing derivatives. The fact that gradient descent finds a useful set of parameters is by no means obvious. The training objective is typically non-convex. The fact that the model is allowed to maintain state is an additional obstacle that makes training of recurrent neural networks challenging.</p>
<p>In this post, we take a step back to reflect on the mathematics of recurrent neural networks. Interpreting recurrent neural networks as dynamical systems, we will show that stochastic gradient descent successfully learns the parameters of an unknown <em>linear</em> dynamical system even though the training objective is non-convex. Along the way, we’ll discuss several useful concepts from control theory, a field that has studied linear dynamical systems for decades. Investigating stochastic gradient descent for learning linear dynamical systems not only bears out interesting connections between machine learning and control theory, it might also provide a useful stepping stone for a deeper undestanding of recurrent neural networks more broadly.</p>
<h2 id="linear-dynamical-systems">Linear dynamical systems</h2>
<p>We focus on time-invariant single-input single-output system. For an input sequence of real numbers $x_1,\dots, x_T\in \mathbb{R}$, the system maintains a sequence of hidden states $h_1,\dots, h_T\in \mathbb{R}^n$, and produces a sequence of outputs $y_1,\dots, y_T\in \mathbb{R}$ according to the following rules:</p>
<script type="math/tex; mode=display">h_{t+1} = Ah_t + Bx_t~~~~~~~~~~~~~~~~~~~~~</script>
<script type="math/tex; mode=display">\quad \quad\quad~y_t = Ch_t+Dx_t+\xi_t ~~~~~~~~~~~~~~~~(1)</script>
<p>Here $A,B,C,D$ are linear transformations with compatible dimensions, and $\xi_t$ is Gaussian noise added to the output at each time. In the learning problem, often called system identification in control theory, we observe samples of input-output pairs $((x_1,\dots, x_T),(y_1,\dots y_T))$ and aim to recover the parameters of the underlying linear system.</p>
<p>Although control theory provides a rich set of techniques for identifying and manipulating linear systems, maximum likelihood estimation with stochastic gradient descent remains a popular heuristic.</p>
<p>We denote by $\Theta = (A,B,C,D)$ the parameters of the true system. We parametrize our model with $\widehat{\Theta} = (\hat{A},\hat{B},\hat{C},\hat{D})$, and the trained model maintains hidden states $\hat{h}_t$ and outputs $\hat{y}_t$ exactly as in equation (1). For each given example $(x,y) = ((x_1,\dots,x_T), (y_1,\dots, y_t))$, the log-likelihood of model $\widehat{\Theta}$ is
<script type="math/tex">f(\widehat{\Theta}, (x,y)) = \frac{1}{T}\sum_{t=1}^{T}\left\|y_t-\hat{y}_t\right\|^2</script>. The population risk is defined as the expected log-likelihood,</p>
<script type="math/tex; mode=display">f(\widehat{\Theta}) = \mathbb{E}_{(x,y)} \left[f(\widehat{\Theta}, (x,y))\right]</script>
<p>Stochastic gradients of the population risk can be computed in time $O(Tn)$ via back-propagation given random samples. We can therefore directly minimize population risk using stochastic gradient descent. The question is just whether the algorithm actually converges. Even though the state transformations are linear, the objective function we defined is not convex. Luckily, we will see that the objective is still <em>close enough</em> to convex for stochastic gradient to make steady progress towards the global minimum.</p>
<h2 id="hair-dryers-and-quasi-convex-functions">Hair dryers and quasi-convex functions</h2>
<p>Before we go into the math, let’s illustrate the algorithm with a pressing example that we all run into every morning: hair drying. Imagine you have a hair dryer with a <em>low</em> temperature setting and a <em>high</em> temperature setting. Neither setting is ideal. So every morning you switch between the settings frantically in an attempt to modulate to the ideal temperature. Measuring the resulting temperature (red line below) as a function of the input setting (green dots below), the picture you’ll see is something like <a href="https://www.mathworks.com/help/ident/examples/estimating-simple-models-from-real-laboratory-process-data.html?prodcode=ML">this</a>:</p>
<div style="text-align:center;">
<img style="width:800px;" src="/assets/sysid/dryer/dryer-0.svg" />
</div>
<p>You can see that the output temperature is related to the inputs. If you set the temperature to high for long enough, you’ll eventually get a high output temperature. But the system has state. Briefly lowering the temperature has little effect on the outputs. Intuition suggests that these kind of effects should be captured by a system with two or three hidden states. So, let’s see how SGD would go about finding the parameters of the system. We’ll initialize a system with three hidden states such that before training its predictions are just the inputs of the system. We then run SGD with a fixed learning rate on the same sequence for 400 steps.</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img style="width:800px;" id="imganim" src="/assets/sysid/dryer/dryer-1.svg" onclick="forward_image()" />
</div>
<script type="text/javascript">//<![CDATA[
var images = [
"/assets/sysid/dryer/dryer-1.svg",
"/assets/sysid/dryer/dryer-2.svg",
"/assets/sysid/dryer/dryer-3.svg",
"/assets/sysid/dryer/dryer-4.svg",
"/assets/sysid/dryer/dryer-5.svg",
"/assets/sysid/dryer/dryer-6.svg",
"/assets/sysid/dryer/dryer-7.svg",
"/assets/sysid/dryer/dryer-8.svg",
]
var iC = 0
function forward_image(){
iC = iC + 1;
document.getElementById('imganim').src = images[iC%8];
document.getElementById('counter').textContent = 50* (iC%8);
}
//]]>
</script>
<p><!-- end animation --></p>
<p><em>The blue line shows the predictions of SGD after <span style="font-family:monospace;"><span id="counter">0</span>/400</span> gradient updates. Click to advance.</em></p>
<p>Evidently, gradient descent converges just fine on this example. Let’s look at the hair dryer objective function along the line segment between two random points in the domain.</p>
<div style="text-align:center;">
<img src="/assets/sysid/dryer-segment.svg" />
</div>
<p>The function is clearly not convex, but it doesn’t look too bad either. In particular, from the picture, it could be that the objective function is <em>quasi-convex</em>:</p>
<blockquote>
<p><strong>Definition:</strong> For $\tau > 0$, a function $f(\theta)$ is $\tau$-quasi-convex with respect to a global minimum $\theta ^ * $ if for every $\theta$,
<script type="math/tex">\langle \nabla f(\theta), \theta - \theta^* \rangle \ge \tau (f(\theta)-f(\theta^*)).</script></p>
</blockquote>
<p>Intuitively, quasi-convexity states that the descent direction $-\nabla f(\theta)$ is positively correlated with the ideal moving direction $\theta^* -\theta$. This implies that the potential function $\left|\theta-\theta ^ * \right|^2$ decreases in expectation at each step of stochastic gradient descent. This observation plugs nicely into the standard SGD analysis, leading to the following result:</p>
<blockquote>
<p><strong>Proposition:</strong> (informal) Suppose the population risk $f(\theta)$ is $\tau$-quasi-convex, then stochastic gradient descent (with fresh samples at each iteration and proper learning rate) converges to a point $\theta_K$ in $K$ iterations with error bounded by
$ f(\theta_K) - f(\theta^*) \leq O(1/(\tau \sqrt{K}))$.</p>
</blockquote>
<p>The key challenge for us is to understand under what conditions we can prove that the population risk objective is in fact quasi-convex. This requires some background.</p>
<h2 id="control-theory-polynomial-roots-and-pac-man">Control theory, polynomial roots, and Pac-Man</h2>
<p>A linear dynamical system $(A,B,C,D)$ is equivalent to the system $(TAT^{-1}, TB, CT^{-1}, D)$ for any invertible matrix $T$ in terms of the behavior of the outputs. A little thought shows therefore that in its unrestricted parameterization the objective function cannot have a unique optimum. A common way of removing this redundancy is to impose a canonical form. Almost all non-degenerate system admit the <em>controllable canonical form</em>, defined as</p>
<script type="math/tex; mode=display">% <![CDATA[
A\; = \;
\left[ \begin{array}{ccccc} 0 & 1 & 0 & \cdots & 0 \newline 0 & 0 & 1 & \cdots & 0 \newline
\vdots & \vdots & \vdots & \ddots & \vdots \newline 0 & 0 & 0 & \cdots & 1 \newline
-a_n & -a_{n-1} & -a_{n-2} & \cdots & -a_1 \end{array} \right]
\qquad
B = \left[ \begin{array}{c} 0\newline 0 \newline\vdots \newline 0 \newline 1 \end{array} \right] %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
C\;~= \;
\left[ \begin{array}{ccccc} c_1~~~~& c_2~~~~ & c_3~~~~& ~~\cdots\cdots~~~~& c_n \end{array} \right]
\qquad
D =~~ \left[ \begin{array}{c} d\end{array} \right] %]]></script>
<p>We will also parametrize our training model using these forms. One of its nice properties is that the coefficients of the characteristic polynomial of the <em>state transition matrix</em> $A$ can be read off from the last row of $A$. That is,
<script type="math/tex">det(zI-A) = p_a(z) := z^n+a_1z^{n-1}+\dots + a_n.</script></p>
<p>Even in controllable canonical form, it still seems rather difficult to learn arbitrary linear dynamical systems. A natural restriction would be <em>stability</em>, that is, to require that the eigenvalues of $A$ are all bounded by $1.$ Equivalently, the roots of the characteristic polynomial should all be contained in the complex unit disc. Without stability, the state of the system could blow up exponentially making robust learning difficult. But the set of all stable systems forms a non-convex domain. It seems daunting to guarantee that stochastic gradient descent would converge from an arbtirary starting point in this domain without ever leaving the domain.</p>
<p>We will therefore impose a stronger restriction on the roots of the characteristic polynomial. We call this the Pac-Man condition. You can think of it as a strengthening of stability.</p>
<blockquote>
<p><strong>Pac-Man condition</strong>: A linear dynamical system in controllable canonical form satisfies the Pac-Man condition if the coefficient vector $a$ defining the state transition matrix satisfies
<script type="math/tex">|Re(q_a(z))| > |Im(q_a(z))|</script> for all complex numbers $z$ of modulus $|z| = 1$, where $q_a(z) = p_a(z)/z^n = 1+a_1z^{-1}+\dots + a_nz^{-n}$.</p>
</blockquote>
<div style="text-align:center;">
<img style="width:350px;margin-bottom:50px;" src="/assets/sysid/pacman.png" />
<img style="width:400px;" src="/assets/sysid/trace-degree4.png" />
</div>
<p><em>Above, we illustrate this condition for a degree 4 system plotting the value of $q_a(z)$ on complex plane for all complex numbers $z$ on the unit circle.</em></p>
<p>We note that Pac-Man condition is satisfied by vectors $a$ with $|a|_1\le \sqrt{2}/2$. Moreover, if $a$ is a random Gaussian vector with expected $\ell_2$ norm bounded by $o(1/\sqrt{\log n})$, then it will satisfy Pac-Man condition with probability $1-o(1)$. Roughly speaking, the assumption requires the roots of the characteristic polynomial $p_a(z)$ are relatively dispersed inside the unit circle.</p>
<p>The Pac-Man condition has three important implications:</p>
<ol>
<li>
<p>It implies via <a href="https://en.wikipedia.org/wiki/Rouch%C3%A9%27s_theorem">Rouche’s theorem</a> that the spectral radius of A is smaller than 1 and therefore ensures stability of the system.</p>
</li>
<li>
<p>The vectors satisfying it form a convex set in $\mathbb{R}^n$.</p>
</li>
<li>
<p>Finally, it ensures that the objective function is <em>quasi-convex</em></p>
</li>
</ol>
<h2 id="main-result">Main result</h2>
<p>Relying on the Pac-Man condition, we can show:</p>
<blockquote>
<p><strong>Main theorem (Hardt, Ma, Recht, 2016)</strong>: Under the Pac-Man condition, projected gradient descent algorithm, given $N$ sample sequences of length $T$, returns parameters $\widehat{\Theta}$ with population risk
<script type="math/tex">f(\widehat{\Theta}) \le f(\Theta) + poly(n)/\sqrt{NT}.</script></p>
</blockquote>
<p>The theorem sorts out the right dependence on $N$ and $T$. Even if there is only one sequence, we can learn the system provided that the sequence is long enough. Similarly, even if sequences are really short, we can learn provided that there are enough sequences.</p>
<h2 id="quasi-convexity-in-the-frequency-domain">Quasi-convexity in the frequency domain</h2>
<p>To establish quasi-convexity under the Pac-Man condition, we will first develop an explicit formula for the population risk in frequency domain. In doing so, we assume that $x_1,\dots, x_T$ are pairwise independent with mean 0 and variance 1. We also consider the population risk as $T\rightarrow \infty$ for simplicity in this post.</p>
<p>A simple algebraic manipulation simplifies the population risk with infinite sequence length to</p>
<script type="math/tex; mode=display">\lim_{T \rightarrow \infty} f(\widehat{\Theta}) = (\hat{D}-D)^2 + \sum_{k=0}^{\infty} (\hat{C}\hat{A}^kB-CA^k B)^2.</script>
<p>The first term, $(\hat D - D)^2$ is convex and appears nowhere else. We can safely ignore it and focus on the remaining expression instead, which we call the <em>idealized risk</em>:</p>
<script type="math/tex; mode=display">g(\widehat{\Theta}) = \sum_{k=0}^{\infty} (\hat{C}\hat{A}^kB-CA^k B)^2</script>
<p>To deal with the sequence $\hat{C}\hat{A}^kB$, we take its Fourier transform and obtain that</p>
<script type="math/tex; mode=display">\hat{C}\hat{A}^kB, k\ge 1 ~~~~\longrightarrow ~~~~~~~\widehat{G}_{\lambda} = \frac{\hat{c}_1e^{(n-1)\lambda}+\dots+ \hat{c}_n}{e^{n\lambda} + \hat{a}_1e^{(n-1)\lambda}+\dots+\hat{a}_n}, \lambda\in [0,2\pi]</script>
<p>Similarly we take the Fourier transform of $CA^kB$, denoted by $G_{\lambda}$. Then by Parseval’s Theorem, we obtain the following alternative representation of the population risk,</p>
<script type="math/tex; mode=display">f(\widehat{\Theta}) = \int_{0}^{2\pi} |G_{\lambda}-\widehat{G}_{\lambda}|^2 d\lambda.</script>
<p>Mapping out $G_\lambda$ and $\widehat G_\lambda$ for all $\lambda\in [0, 2\pi]$ gives the following picture:</p>
<div style="text-align:center;">
<img style="width:400px;" src="/assets/sysid/transfer/approx-10.png" onclick="forward_transfer_image()" />
<img style="width:400px;" id="transfer-img" src="/assets/sysid/transfer/approx-00.png" onclick="forward_transfer_image()" />
</div>
<script type="text/javascript">//<![CDATA[
var transfer_images = [
"/assets/sysid/transfer/approx-00.png",
"/assets/sysid/transfer/approx-01.png",
"/assets/sysid/transfer/approx-02.png",
"/assets/sysid/transfer/approx-03.png",
"/assets/sysid/transfer/approx-04.png",
"/assets/sysid/transfer/approx-05.png",
"/assets/sysid/transfer/approx-06.png",
"/assets/sysid/transfer/approx-07.png",
"/assets/sysid/transfer/approx-08.png",
"/assets/sysid/transfer/approx-09.png",
"/assets/sysid/transfer/approx-10.png",
]
var iA = 0
function forward_transfer_image(){
iA = iA + 1;
document.getElementById('transfer-img').src = transfer_images[iA%11];
document.getElementById('transfer-counter').textContent = (iA%11);
}
//]]>
</script>
<p><em>Left: Target transfer function $G$. Right: Approximation $\widehat G$ at step <span style="font-family:monospace" id="transfer-counter">0</span>/10. Click to advance.</em></p>
<p>Given this pretty representation of the idealized risk objective, we can finally prove our main lemma.</p>
<blockquote>
<p><strong>Lemma:</strong> Suppose $\Theta$ satisfies the Pac-Man condition. Then,
for every $0\le \lambda\le 2\pi$, $|G_{\lambda}-\widehat{G}_{\lambda}|^2$,
as a function of $\hat{A},\hat{C}$ is quasi-convex in the Pac-Man region.</p>
</blockquote>
<p>The lemma reduces to the following simple claim.</p>
<blockquote>
<p><strong>Claim:</strong> The function $h(\hat{u},\hat{v}) = |\hat{u}/\hat{v} - u/v|^2$ is quasi-convex in the region where $Re(\hat{v}/v) > 0$.</p>
</blockquote>
<p>The proof simply involves computing the gradients and checking the conditions for quasi-convexity by elementary algebra. We omit a formal proof, but intead show a plot of the function $h(\hat{u}, \hat{v}) = (\hat{u}/\hat{v}- 1)^2$ over the reals:</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img style="height:600px" id="3dplot-img" src="/assets/sysid/3dplot/3dplot-30.jpg" onclick="forward_3dplot_image()" />
<p style="text-align:center;"> Click to rotate.</p>
</div>
<script type="text/javascript">//<![CDATA[
var plot3d_images = [
"/assets/sysid/3dplot/3dplot-0.jpg",
"/assets/sysid/3dplot/3dplot-10.jpg",
"/assets/sysid/3dplot/3dplot-20.jpg",
"/assets/sysid/3dplot/3dplot-30.jpg",
"/assets/sysid/3dplot/3dplot-40.jpg",
"/assets/sysid/3dplot/3dplot-50.jpg",
"/assets/sysid/3dplot/3dplot-60.jpg",
"/assets/sysid/3dplot/3dplot-70.jpg",
"/assets/sysid/3dplot/3dplot-80.jpg",
"/assets/sysid/3dplot/3dplot-90.jpg",
]
var iB = 3
var inc_sign = 1
function forward_3dplot_image(){
iB = iB + inc_sign;
if (iB == 9) {
inc_sign = -1;
}
if (iB == 0) {
inc_sign = 1;
}
document.getElementById('3dplot-img').src = plot3d_images[iB];
}
//]]>
</script>
<p><!-- end animation --></p>
<p>To see how the lemma follows from the previous claim we note that quasi-convexity is preserved under composition with any linear transformation. Specifically, $h(z)$ is quasi-convex, then $h(R x)$ is also quasi-convex for any linear map $R$. So, consider the linear map:</p>
<script type="math/tex; mode=display">(\hat{a},\hat{c})\mapsto (\hat u, \hat v) = (\hat{c}_1e^{(n-1)\lambda}+\dots+ \hat{c}_n, e^{n\lambda}
+\hat{a}_1e^{(n-1)\lambda}+\dots+\hat{a}_n)</script>
<p>With this linear transformation, our simple claim about a bivariate function extends to show that $(G_{\lambda}-\widehat{G}_{\lambda})^2$ is quasi-convex when $Re(\hat{v}/v) \ge 0$. In particular, when $\hat{a}$ and $a$ both satisfy the Pac-Man condition, then $\hat{v}$ and $v$ both reside in the 90 degree wedge. Therefore they have an angle smaller than 90 degree. This implies that $Re(\hat{v}/v) > 0$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We saw conditions under which stochastic gradient descent successfully learns a linear dynamical system. In <a href="https://arxiv.org/abs/1609.05191">our paper</a>, we further show that allowing our learned system to have more parameters than the target system makes the problem dramatically easier. In particular, at the expense of slight over-parameterization we can weaken the Pac-Man condition to a mild separation condition on the roots of the characteristic polynomial. This is consistent with empirical observations both in machine learning and control theory that highlight the effectiveness of additional model parameters.</p>
<p>More broadly, we hope that our techniques will be a first stepping stone toward a better theoretical understanding of recurrent neural networks.</p>
Thu, 13 Oct 2016 10:00:00 +0000
http://offconvex.github.io/2016/10/13/gradient-descent-learns-dynamical-systems/
http://offconvex.github.io/2016/10/13/gradient-descent-learns-dynamical-systems/Linear algebraic structure of word meanings<p>Word embeddings capture the meaning of a word using a low-dimensional vector and are ubiquitous in natural language processing (NLP). (See my earlier <a href="http://www.offconvex.org/2015/12/12/word-embeddings-1/">post 1</a>
and <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a>.) It has always been unclear how to interpret the embedding when the word in question is <em>polysemous,</em> that is, has multiple senses. For example, <em>tie</em> can mean an article of clothing, a drawn sports match, and a physical action.</p>
<p>Polysemy is an important issue in NLP and much work relies upon <a href="https://wordnet.princeton.edu/">WordNet</a>, a hand-constructed repository of word senses and their interrelationships. Unfortunately, good WordNets do not exist for most languages, and even the one in English is believed to be rather incomplete. Thus some effort has been spent on methods to find different senses of words.</p>
<p>In this post I will talk about <a href="https://arxiv.org/abs/1601.03764">my joint work with Li, Liang, Ma, Risteski</a> which shows that actually word senses are easily accessible in many current word embeddings. This goes against conventional wisdom in NLP, which is that <em>of course</em>, word embeddings do not suffice to capture polysemy since they use a single vector to represent the word, regardless of whether the word has one sense, or a dozen. Our work shows that major senses of the word lie in linear superposition within the embedding, and are extractable using sparse coding.</p>
<p>This post uses embeddings constructed using our method and the wikipedia corpus, but similar techniques also apply (with some loss in precision) to other embeddings described in <a href="http://www.offconvex.org/2015/12/12/word-embeddings-1/">post 1</a> such as word2vec, Glove, or even the decades-old PMI embedding.</p>
<h2 id="a-surprising-experiment">A surprising experiment</h2>
<p>Take the viewpoint –simplistic yet instructive– that a polysemous word like <em>tie</em> is a single lexical token that represents unrelated words <em>tie1</em>, <em>tie2</em>, …
Here is a surprising experiment that suggests that the embedding for <em>tie</em> should be approximately a weighted sum of the (hypothethical) embeddings of <em>tie1</em>, <em>tie2</em>, …</p>
<blockquote>
<p>Take two random words $w_1, w_2$. Combine them into an artificial polysemous word $w_{new}$ by replacing every occurrence of $w_1$ or $w_2$ in the corpus by $w_{new}.$ Next, compute an embedding for $w_{new}$ using the same embedding method while deleting embeddings for $w_1, w_2$ but preserving the embeddings for all other words. Compare the embedding $v_{w_{new}}$ to linear combinations of $v_{w_1}$ and
$v_{w_2}$.</p>
</blockquote>
<p>Repeating this experiment with a wide range of values for the ratio $r$ between the frequencies of $w_1$ and $w_2$, we find that $v_{w_{new}}$ lies close to the subspace spanned by $v_{w_1}$ and $v_{w_2}$: the cosine of its angle with the subspace is on average $0.97$ with standard deviation $0.02$. Thus $v_{w_{new}} \approx \alpha v_{w_1} + \beta v_{w_2}$.
We find that $\alpha \approx 1$ whereas $\beta \approx 1- c\lg r$
for some constant $c\approx 0.5$. (Note this formula is meaningful when the frequency ratio $r$ is not too large, i.e. when $ r < 10^{1/c} \approx 100$.) Thanks to this logarithm, the infrequent sense is not swamped out in the embedding, even if it is 50 times less frequent than the dominant sense. This is an important reason behind the success of our method for extracting word senses.</p>
<p>This experiment –to which we were led by our theoretical investigations– is very surprising
because the embedding is the solution to a complicated, nonconvex optimization, yet it behaves in such a striking linear way. You can read our paper for an intuitive explanation using our theoretical model from <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a>.</p>
<h2 id="extracting-word-senses-from-embeddings">Extracting word senses from embeddings</h2>
<p>The above experiment suggests that</p>
<script type="math/tex; mode=display">v_{tie} \approx \alpha_1 \cdot v_{ tie1} + \alpha_2 \cdot v_{tie2} + \alpha_3 \cdot v_{tie3} +\cdots \qquad (1)</script>
<p>but this alone is insufficient to mathematically pin down the senses, since $v_{tie}$ can be expressed in infinitely many ways as such a combination. To pin down the senses we will interrelate the senses of different words —for example, relate the “article of clothing” sense <em>tie1</em> with <em>shoe, jacket</em> etc.</p>
<p>The word senses <em>tie1, tie2,..</em> correspond to “different things being talked about” —in other words, different word distributions occuring around <em>tie</em>.
Now remember that <a href="http://128.84.21.199/abs/1502.03520v6">our earlier paper</a> described in
<a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a> gives an interpretation of “what’s being talked about”: it is called <em>discourse</em> and
it is represented by a unit vector in the embedding space. In particular, the theoretical model
of <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a> imagines a text corpus as being generated by a random walk on
discourse vectors. When the walk is at a discourse $c_t$ at time $t$, it outputs a few words using a loglinear distribution:</p>
<script type="math/tex; mode=display">\Pr[w~\mbox{emitted at time $t$}~|~c_t] \propto \exp(c_t\cdot v_w). \qquad (2)</script>
<p>One imagines there exists a “clothing” discourse that has high probability of outputting the <em>tie1</em> sense, and also of outputting related words such as <em>shoe, jacket,</em> etc.
Similarly there may be a “games/matches” discourse that has high probability of outputting <em>tie2</em> as well as <em>team, score</em> etc.</p>
<p>By equation (2) the probability of being output by a discourse is determined by the
inner product, so one expects that the vector for “clothing” discourse has high inner product with all of <em>shoe, jacket, tie1</em> etc., and thus can stand as surrogate for $v_{tie1}$ in expression (1)! This motivates the following global optimization:</p>
<blockquote>
<p>Given word vectors in $\Re^d$, totaling about $60,000$ in this case, a sparsity parameter $k$,
and an upper bound $m$, find a set of unit vectors $A_1, A_2, \ldots, A_m$ such that
<script type="math/tex">v_w = \sum_{j=1}^m\alpha_{w,j}A_j + \eta_w \qquad (3)</script>
where at most $k$ of the coefficients $\alpha_{w,1},\dots,\alpha_{w,m}$ are nonzero (so-called <em>hard sparsity constraint</em>), and $\eta_w$ is a noise vector.</p>
</blockquote>
<p>Here $A_1, \ldots A_m$ represent important discourses in the corpus, which
we refer to as <em>atoms of discourse.</em></p>
<p>Optimization (3) is a surrogate for the desired expansion of $v_{tie}$ in (1) because one can hope that the atoms of discourse will contain atoms corresponding to <em>clothing</em>, <em>sports matches</em> etc. that will have high inner product (close to $1$) with <em>tie1,</em> <em>tie2</em> respectively. Furthermore, restricting $m$ to be much smaller than the number of words ensures that each atom needs to be used for multiple words, e.g., reuse the “clothing” atom
for <em>shoes</em>, <em>jacket</em> etc. as well as for <em>tie</em>.</p>
<p>Both $A_j$’s and $\alpha_{w,j}$’s are unknowns in this optimization. This is nothing but <em>sparse coding,</em> useful in neuroscience, image processing, computer vision, etc. It is nonconvex and computationally NP-hard in the worst case, but can be solved quite efficiently in practice using something called the k-SVD algorithm described in <a href="http://www.cs.technion.ac.il/~elad/publications/others/PCMI2010-Elad.pdf">Elad’s survey, lecture 4</a>. We solved this problem with sparsity
$k=5$ and using $m$ about $2000$. (Experimental details are in the paper. Also, some theoretical
analysis of such an algorithm is possible; see this <a href="http://www.offconvex.org/2016/05/08/almostconvexitySATM/">earlier post</a>.)</p>
<h1 id="experimental-results">Experimental Results</h1>
<p>Each discourse atom defines via (2) a distribution on words, which due to the exponential appearing in (2) strongly favors words whose embeddings have a larger inner product with it. In practice, this distribution is quite concentrated on as few as 50-100 words, and the “meaning” of a discourse atom can be roughly determined by looking at a few nearby words. This is how we visualize atoms in the figures below. The first figure gives a few representative atoms of discourse.</p>
<p style="text-align:center;">
<img src="http://www.cs.princeton.edu/~arora/pubs/discourseatoms.jpg" alt="A few of the 2000 atoms of discourse found" />
</p>
<p>And here are the discourse atoms used to represent two polysemous words, <em>tie</em> and <em>spring</em></p>
<p style="text-align:center;">
<img src="http://www.cs.princeton.edu/~arora/pubs/atomspolysemy.jpg" alt="Discourse atoms expressing the words tie and spring." />
</p>
<p>You can see that the discourse atoms do correspond to senses of these words.</p>
<p>Finally, we also have a technique that, given a target word, generates representative sentences according to its various senses as detected by the algorithm. Below are the sentences returned for
<em>ring.</em> (N.B. The mathematical meaning was missing in WordNet but was picked up by our method.)</p>
<p style="text-align:center;">
<img src="http://www.cs.princeton.edu/~arora/pubs/repsentences.jpg" alt="Representative sentences for different senses of the word ring." />
</p>
<h2 id="a-new-testbed-for-testing-comprehension-of-word-senses">A new testbed for testing comprehension of word senses</h2>
<p>Many tests have been proposed to test an algorithm’s grasp of word senses. They often involve
hard-to-understand metrics such as distance in WordNet, or sometimes tied to performance on specific applications like web search.</p>
<p>We propose a new simple test –inspired by word-intrusion tests for topic coherence
due to <a href="https://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf">Chang et al 2009</a>– which has the advantages of being easy to understand, and can also be administered to humans.</p>
<p>We created a testbed using 200 polysemous words and their 704 senses according to WordNet. Each “sense” is represented by a set of 8 related words; these were collected from WordNet and online dictionaries by college students who were told to identify most relevant other words occurring in the online definitions of this word sense as well as in the accompanying illustrative sentences. These 8 words are considered as <em>ground truth</em> representation of the word sense: e.g., for the “tool/weapon” sense of <em>axe</em> they were: <em>handle, harvest, cutting, split, tool, wood, battle, chop.</em></p>
<blockquote>
<p><strong>Police line-up test for word senses:</strong> the algorithm is given a random one of these 200 polysemous words and a set of $m$ senses which contain the true sense for the word as well as some <em>distractors,</em> which are randomly picked senses from other words in the testbed. The test taker has to identify the word’s true senses amont these $m$ senses.</p>
</blockquote>
<p>As usual, accuracy is measured using <em>precision</em> (what fraction of the algorithm/human’s guesses
were correct) and <em>recall</em> (how many correct senses were among the guesses).</p>
<p>For $m=20$ and $k=4$, our algorithm succeeds with precision $63\%$ and recall $70\%$, and performance remains reasonable for $m=50$. We also administered the test to a group of grad students.
Native English speakers had precision/recall scores in the $75$ to $90$ percent range.
Non-native speakers had scores roughly similar to our algorithm.</p>
<p>Our algorithm works something like this: If $w$ is the target word, then take all discourse atoms
computed for that word, and compute a certain similarity score between each atom and each of the $m$ senses, where the words in the senses are represented by their word vectors. (Details are in the paper.)</p>
<h2 id="takeaways">Takeaways</h2>
<p>Word embeddings have been useful in a host of other settings, and now it appears that
they also can easily yield different senses of a polysemous word. We have some subsequent applications of these ideas to other previously studied settings, including topic models, creating
WordNets for other languages, and understanding the semantic content of fMRI brain measurements. I’ll describe some of them in future posts.</p>
Sun, 10 Jul 2016 10:30:00 +0000
http://offconvex.github.io/2016/07/10/embeddingspolysemy/
http://offconvex.github.io/2016/07/10/embeddingspolysemy/