Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Understanding optimization in deep learning by analyzing trajectories of gradient descent<p>Neural network optimization is fundamentally non-convex, and yet simple gradient-based algorithms seem to consistently solve such problems.
This phenomenon is one of the central pillars of deep learning, and forms a mystery many of us theorists are trying to unravel.
In this post I’ll survey some recent attempts to tackle this problem, finishing off with a discussion on my <a href="https://arxiv.org/pdf/1810.02281.pdf">new paper with Sanjeev Arora, Noah Golowich and Wei Hu</a>, which for the case of gradient descent over deep linear neural networks, provides a guarantee for convergence to global minimum at a linear rate.</p>
<h2 id="landscape-approach-and-its-limitations">Landscape Approach and Its Limitations</h2>
<p>Many papers on optimization in deep learning implicitly assume that a rigorous understanding will follow from establishing geometric properties of the loss <em>landscape</em>, and in particular, of <em>critical points</em> (points where the gradient vanishes).
For example, through an analogy with the spherical spin-glass model from condensed matter physics, <a href="http://proceedings.mlr.press/v38/choromanska15.pdf">Choromanska et al. 2015</a> argued for what has become a colloquial conjecture in deep learning:</p>
<blockquote>
<p><strong>Landscape Conjecture:</strong>
In neural network optimization problems, suboptimal critical points are very likely to have negative eigenvalues to their Hessian.
In other words, there are almost <em>no poor local minima</em>, and nearly all <em>saddle points are strict</em>.</p>
</blockquote>
<p>Strong forms of this conjecture were proven for loss landscapes of various simple problems involving <strong>shallow</strong> (two layer) models, e.g. <a href="https://papers.nips.cc/paper/6271-global-optimality-of-local-search-for-low-rank-matrix-recovery.pdf">matrix sensing</a>, <a href="https://papers.nips.cc/paper/6048-matrix-completion-has-no-spurious-local-minimum.pdf">matrix completion</a>, <a href="http://proceedings.mlr.press/v40/Ge15.pdf">orthogonal tensor decomposition</a>, <a href="https://arxiv.org/pdf/1602.06664.pdf">phase retrieval</a>, and <a href="http://proceedings.mlr.press/v80/du18a/du18a.pdf">neural networks with quadratic activation</a>.
There was also work on establishing convergence of gradient descent to global minimum when the Landscape Conjecture holds, as described in the excellent posts on this blog by <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge</a>, <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht</a> and <a href="http://www.offconvex.org/2017/07/19/saddle-efficiency/">Chi Jin and Michael Jordan</a>.
They describe how gradient descent can arrive at a second order local minimum (critical point whose Hessian is positive semidefinite) by escaping all strict saddle points, and how this process is efficient given that perturbations are added to the algorithm.
Note that under the Landscape Conjecture, i.e. when there are no poor local minima and non-strict saddles, second order local minima are also global minima.</p>
<p style="text-align:center;">
<img src="/assets/optimization-beyond-landscape-points.png" width="100%" alt="Local minima and saddle points" />
</p>
<p>However, it has become clear that the landscape approach (and the Landscape Conjecture) cannot be applied as is to <strong>deep</strong> (three or more layer) networks, for several reasons.
First, deep networks typically induce non-strict saddles (e.g. at the point where all weights are zero, see <a href="https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf">Kawaguchi 2016</a>).
Second, a landscape perspective largely ignores algorithmic aspects that empirically are known to greatly affect convergence with deep networks — for example the <a href="http://proceedings.mlr.press/v28/sutskever13.html">type of initialization</a>, or <a href="http://proceedings.mlr.press/v37/ioffe15.pdf">batch normalization</a>.
Finally, as I argued in my <a href="http://www.offconvex.org/2018/03/02/acceleration-overparameterization/">previous blog post</a>, based upon <a href="http://proceedings.mlr.press/v80/arora18a/arora18a.pdf">work with Sanjeev Arora and Elad Hazan</a>, adding (redundant) linear layers to a classic linear model can sometimes accelerate gradient-based optimization, without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem.
Any landscape analysis that relies on properties of critical points alone will have difficulty explaining this phenomenon, as through such lens, nothing is easier to optimize than a convex objective with a single critical point which is the global minimum.</p>
<h2 id="a-way-out">A Way Out?</h2>
<p>The limitations of the landscape approach for analyzing optimization in deep learning suggest that it may be abstracting away too many important details.
Perhaps a more relevant question than “is the landscape graceful?” is “what is the behavior of specific optimizer <strong>trajectories</strong> emanating from specific initializations?”.</p>
<p style="text-align:center;">
<img src="/assets/optimization-beyond-landscape-trajectories.png" width="66%" alt="Different trajectories lead to qualitatively different results" />
</p>
<p>While the trajectory-based approach is seemingly much more burdensome than landscape analyses, it is already leading to notable progress.
Several recent papers (e.g. <a href="http://proceedings.mlr.press/v70/brutzkus17a/brutzkus17a.pdf">Brutzkus and Globerson 2017</a>; <a href="https://papers.nips.cc/paper/6662-convergence-analysis-of-two-layer-neural-networks-with-relu-activation.pdf">Li and Yuan 2017</a>; <a href="http://proceedings.mlr.press/v70/zhong17a/zhong17a.pdf">Zhong et al. 2017</a>; <a href="http://proceedings.mlr.press/v70/tian17a/tian17a.pdf">Tian 2017</a>; <a href="https://openreview.net/pdf?id=rJ33wwxRb">Brutzkus et al. 2018</a>; <a href="http://proceedings.mlr.press/v75/li18a/li18a.pdf">Li et al. 2018</a>; <a href="https://arxiv.org/pdf/1806.00900.pdf">Du et al. 2018</a>; <a href="http://romaincouillet.hebfree.org/docs/conf/nips_GDD.pdf">Liao et al. 2018</a>) have adopted this strategy, successfully analyzing different types of shallow models.
Moreover, trajectory-based analyses are beginning to set foot beyond the realm of the landscape approach — for the case of linear neural networks, they have successfully established convergence of gradient descent to global minimum under <strong>arbitrary depth</strong>.</p>
<h2 id="trajectory-based-analyses-for-deep-linear-neural-networks">Trajectory-Based Analyses for Deep Linear Neural Networks</h2>
<p>Linear neural networks are fully-connected neural networks with linear (no) activation.
Specifically, a depth $N$ linear network with input dimension $d_0$, output dimension $d_N$, and hidden dimensions $d_1,d_2,\ldots,d_{N-1}$, is a linear mapping from $\mathbb{R}^{d_0}$ to $\mathbb{R}^{d_N}$ parameterized by $x \mapsto W_N W_{N-1} \cdots W_1 x$, where $W_j \in \mathbb{R}^{d_j \times d_{j-1}}$ is regarded as the weight matrix of layer $j$.
Though trivial from a representational perspective, linear neural networks are, somewhat surprisingly, complex in terms of optimization — they lead to non-convex training problems with multiple minima and saddle points.
Being viewed as a theoretical surrogate for optimization in deep learning, the application of gradient-based algorithms to linear neural networks is receiving significant attention these days.</p>
<p>To my knowledge, <a href="https://arxiv.org/pdf/1312.6120.pdf">Saxe et al. 2014</a> were the first to carry out a trajectory-based analysis for deep (three or more layer) linear networks, treating gradient flow (gradient descent with infinitesimally small learning rate) minimizing $\ell_2$ loss over whitened data.
Though a very significant contribution, this analysis did not formally establish convergence to global minimum, nor treat the aspect of computational complexity (number of iterations required to converge).
The recent work of <a href="http://proceedings.mlr.press/v80/bartlett18a.html">Bartlett et al. 2018</a> makes progress towards addressing these gaps, by applying a trajectory-based analysis to gradient descent for the special case of linear residual networks, i.e. linear networks with uniform width across all layers ($d_0=d_1=\cdots=d_N$) and identity initialization ($W_j=I$, $\forall j$).
Considering different data-label distributions (which boil down to what they refer to as “targets”), Bartlett et al. demonstrate cases where gradient descent provably converges to global minimum at a linear rate — loss is less than $\epsilon>0$ from optimum after $\mathcal{O}(\log\frac{1}{\epsilon})$ iterations — as well as situations where it fails to converge.</p>
<p>In a <a href="https://arxiv.org/pdf/1810.02281.pdf">new paper with Sanjeev Arora, Noah Golowich and Wei Hu</a>, we take an additional step forward in virtue of the trajectory-based approach.
Specifically, we analyze trajectories of gradient descent for any linear neural network that does not include “bottleneck layers”, i.e. whose hidden dimensions are no smaller than the minimum between the input and output dimensions ($d_j \geq \min\{d_0,d_N\}$, $\forall j$), and prove convergence to global minimum, at a linear rate, provided that initialization meets the following two conditions:
<em>(i)</em> <em>approximate balancedness</em> — $W_{j+1}^\top W_{j+1} \approx W_j W_j^\top$, $\forall j$;
and <em>(ii)</em> <em>deficiency margin</em> — initial loss is smaller than the loss of any rank deficient solution.
We show that both conditions are necessary, in the sense that violating any one of them may lead to a trajectory that fails to converge.
Approximate balancedness at initialization is trivially met in the special case of linear residual networks, and also holds for the customary setting of initialization via small random perturbations centered at zero.
The latter also leads to deficiency margin with positive probability.
For the case $d_N=1$, i.e. scalar regression, we provide a random initialization scheme under which both conditions are met, and thus convergence to global minimum at linear rate takes place, with constant probability.</p>
<p>Key to our analysis is the observation that if weights are initialized to be approximately balanced, they will remain that way throughout the iterations of gradient descent.
In other words, trajectories taken by the optimizer adhere to a special characterization:</p>
<blockquote>
<p><strong>Trajectory Characterization:</strong> <br /></p>
<p><script type="math/tex">W_{j+1}^\top(t) W_{j+1}(t) \approx W_j(t) W_j^\top(t), \quad j=1,\ldots,N-1, \quad t=0,1,\ldots</script>,</p>
</blockquote>
<p>which means that throughout the entire timeline, all layers have (approximately) the same set of singular values, and the left singular vectors of each layer (approximately) coincide with the right singular vectors of the layer that follows.
We show that this regularity implies steady progress for gradient descent, thereby demonstrating that even in cases where the loss landscape is complex as a whole (includes many non-strict saddle points), it may be particularly well-behaved around the specific trajectories taken by the optimizer.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Tackling the question of optimization in deep learning through the landscape approach, i.e. by analyzing the geometry of the objective independently of the algorithm used for training, is conceptually appealing.
However this strategy suffers from inherent limitations, predominantly as it requires the entire objective to be graceful, which seems to be too strict of a demand.
The alternative approach of taking into account the optimizer and its initialization, and focusing on the landscape only along the resulting trajectories, is gaining more and more traction.
While landscape analyses have thus far been limited to shallow (two layer) models only, the trajectory-based approach has recently treated arbitrarily deep models, proving convergence of gradient descent to global minimum at a linear rate.
Much work however remains to be done, as this success covered only linear neural networks.
I expect the trajectory-based approach to be key in developing our formal understanding of gradient-based optimization for deep non-linear networks as well.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Wed, 07 Nov 2018 04:00:00 -0800
http://offconvex.github.io/2018/11/07/optimization-beyond-landscape/
http://offconvex.github.io/2018/11/07/optimization-beyond-landscape/Simple and efficient semantic embeddings for rare words, n-grams, and language features<p>Distributional methods for capturing meaning, such as word embeddings, often require observing many examples of words in context. But most humans can infer a reasonable meaning from very few or even a single occurrence. For instance, if we read “Porgies live in shallow temperate marine waters,” we have a good idea that a <em>porgy</em> is a fish. Since language corpora often have a long tail of “rare words,” it is an interesting problem to imbue NLP algorithms with this capability. This is especially important for n-grams (i.e., ordered n-tuples of words, like “ice cream”), many of which occur rarely in the corpus.</p>
<p>Here we describe a simple but principled approach called <em>à la carte</em> embeddings, described in our <a href="http://aclweb.org/anthology/P18-1002">ACL’18 paper</a> with Yingyu Liang, Tengyu Ma, and Brandon Stewart. It also easily extends to learning embeddings of arbitrary language features such as word-senses and $n$-grams. The paper also combines these with our recent <a href="http://www.offconvex.org/2018/06/25/textembeddings/">deep-learning-free text embeddings</a> to get simple deep-learning free text embeddings with even better performance on downstream classification tasks, quite competitive with deep learning approaches.</p>
<h2 id="inducing-word-embedding-from-their-contexts-a-surprising-linear-relationship">Inducing word embedding from their contexts: a surprising linear relationship</h2>
<p>Suppose a single occurrence of a word $w$ is surrounded by a sequence $c$ of words. What is a reasonable guess for the word embedding $v_w$ of $w$? For convenience, we will let $u_w^c$ denote the average of the word embeddings of words in $c$. Anybody who knows the word2vec method may reasonably guess the following.</p>
<blockquote>
<p><strong>Guess 1:</strong> Up to scaling, $u_w^c$ is a good estimate for $v_w$.</p>
</blockquote>
<p>Unfortunately, this totally fails. Even taking thousands of occurrences of $w$, the average of such estimates stays far from the ground truth embedding $v_w$. The following discovery should therefore be surprising (read below for a theoretical justification):</p>
<blockquote>
<p><strong>Theorem 1</strong> (From <a href="https://transacl.org/ojs/index.php/tacl/article/view/1346">this TACL’18 paper</a>): There is a single matrix $A$ (depending only upon the text corpus) such that $A u_w^c$ is a good estimate for $v_w$.</p>
</blockquote>
<p>Note that the best such $A$ can be found via linear regression by minimizing the average $|Au_w^c -v_w|^2$ over occurrences of frequent words $w$, for which we already have word embeddings.</p>
<p>Once such an $A$ has been learnt from frequent words, the induction of embeddings for new words works very well. As we receive more and more occurrences of $w$ the average of $Au_w^c$ over all sentences containing $w$ has cosine similarity $>0.9$ with the true word embedding $v_w$ (this holds for GloVe as well as word2vec).</p>
<p>Thus the learnt $A$ gives a way to induce embeddings for new words from a few or even a single occurrence. We call this the <em>à la carte</em> embedding of $w$, because we don’t need to pay the <em>prix fixe</em> of re-running GloVe or word2vec on the entire corpus each time a new word is needed.</p>
<h3 id="testing-embeddings-for-rare-words">Testing embeddings for rare words</h3>
<p>Using Stanford’s <a href="https://nlp.stanford.edu/~lmthang/morphoNLM/">Rare Words</a> dataset we created the
<a href="http://nlp.cs.princeton.edu/CRW/"><em>Contextual Rare Words</em></a> dataset where, along with word pairs and human-rated scores, we also provide contexts (i.e., few usages) for the rare words.</p>
<p>We compare the performance of our method with alternatives such as <a href="http://www.offconvex.org/2018/06/17/textembeddings/">top singular component removal and frequency down-weighting</a> and find that <em>à la carte</em> embedding consistently outperforms other methods and requires far fewer contexts to match their best performance.
Below we plot the increase in Spearman correlation with human ratings as the tested algorithms are given more samples of the words in context. We see that given only 8 occurences of the word, the <em>a la carte</em> method outperforms other baselines that’re given 128 occurences.</p>
<p style="text-align:center;">
<img src="/assets/ALCcrwplot.svg" width="60%" />
</p>
<p>Now we turn to the task mentioned in the opening para of this post. <a href="http://aclweb.org/anthology/D17-1030">Herbelot and Baroni</a> constructed a “nonce” dataset consisting of single-word concepts and their Wikipedia definitions, to test algorithms that “simulate the process by which a competent speaker encounters a new word in known contexts.” They tested various methods, including a modified version of word2vec.
As we show in the table below, <em>à la carte</em> embedding outperforms all their methods in terms of the average rank of the target vector’s similarity with the constructed vector. The true word embedding is among the closest 165 or so word vectors to our embedding.
(Note that the vocabulary size exceeds 200K, so this is considered a strong performance.)</p>
<p style="text-align:center;">
<img src="/assets/ALCnonce.svg" width="50%" />
</p>
<h2 id="a-theory-of-induced-embeddings-for-general-features">A theory of induced embeddings for general features</h2>
<p>Why should the matrix $A$ mentioned above exist in the first place?
Sanjeev, Yingyu, and Tengyu’s <a href="https://transacl.org/ojs/index.php/tacl/article/view/1346">TACL’18</a> paper together with Yuanzhi Li and Andrej Risteski gives a justification via a latent-variable model of corpus generation that is a modification of their earlier model described in <a href="https://transacl.org/ojs/index.php/tacl/article/view/742">TACL’16</a> (see also this <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">blog post</a>) The basic idea is to consider a random walk over an ellipsoid instead of the unit square.
Under this modification of the rand-walk model, whose approximate MLE objective is similar to that of GloVe, their first theorem shows the following:</p>
<script type="math/tex; mode=display">\exists~A\in\mathbb{R}^{d\times d}\textrm{ s.t. }v_w=A\mathbb{E} \left[\frac{1}{n}\sum\limits_{w'\in c}v_{w'}\bigg|w\in c\right]=A\mathbb{E}v_w^\textrm{avg}~\forall~w</script>
<p>where the expectation is taken over possible contexts $c$.</p>
<p>This result also explains the linear algebraic structure of the embeddings of polysemous words (words having multiple possible meanings, such as <em>tie</em>) discussed in an earlier <a href="http://www.offconvex.org/2016/07/10/embeddingspolysemy/">post</a>.
Assuming for simplicity that $tie$ only has two meanings (<em>clothing</em> and <em>game</em>), it is easy to see that its word embedding is a linear transformation of the sum of the average context vectors of its two senses:</p>
<script type="math/tex; mode=display">v_w=A\mathbb{E}v_w^\textrm{avg}=A\mathbb{E}\left[v_\textrm{clothing}^\textrm{avg}+v_\textrm{game}^\textrm{avg}\right]=A\mathbb{E}v_\textrm{clothing}^\textrm{avg}+A\mathbb{E}v_\textrm{game}^\textrm{avg}</script>
<p>The above also shows that we can get a reasonable estimate for the vector of the sense <em>clothing</em>, and, by extension many other features of interest, by setting $v_\textrm{clothing}=A\mathbb{E}v_\textrm{clothing}^\textrm{avg}$.
Note that this linear method also subsumes other context representations, such as removing the <a href="http://www.offconvex.org/2018/06/17/textembeddings/">top singular component or down-weighting frequent directions</a>.</p>
<h3 id="n-gram-embeddings">$n$-gram embeddings</h3>
<p>While the theory suggests existence of a linear transform between word embeddings and their context embeddings, one could also use this linear transform to induce embeddings for other kinds of linguistic features in context.
We test this hypothesis by inducing embeddings for $n$-grams by using contexts from a large text corpus and word embeddings trained on the same corpus.
A qualitative evaluation of the $n$-gram embeddings is done by finding the closest words to it in terms of cosine similarity between the embeddings.
As evident from the below figure, <em>à la carte</em> bigram embeddings capture the meaning of the phrase better than some other compositional and learned bigram embeddings.</p>
<p style="text-align:center;">
<img src="/assets/ALCngram_quality.png" width="65%" />
</p>
<h3 id="sentence-embeddings">Sentence embeddings</h3>
<p>We also use these $n$-gram embeddings to construct sentence embeddings, similarly to <a href="http://www.offconvex.org/2018/06/25/textembeddings/">DisC embeddings</a>, to evaluate on classification tasks.
A sentence is embedded as the concatenation of sums of embeddings for $n$-gram in the sentence for use in downstream classification tasks.
Using this simple approach we can match the performance of other linear and LSTM representations, even obtaining state-of-the-art results on some of them. Note that Logeswaran and Lee is a contemporary paper that uses deep nets.</p>
<p style="text-align:center;">
<img src="/assets/ALCngram_clf.svg" width="80%" />
</p>
<h2 id="discussion">Discussion</h2>
<p>Our <em>à la carte</em> method is simple, almost elementary, and yet gives results competitive with many other feature embedding methods and also beats them in many cases.
Can one do zero-shot learning of word embeddings, i.e. inducing embeddings for a words/features without any context?
Character level methods such as <a href="https://fasttext.cc/">fastText</a> can do this and it is a good problem to incorporate character level information into the <em>à la carte</em> approach (the few things we tried didn’t work so far).</p>
<p>The <em>à la carte</em> code is <a href="https://github.com/NLPrinceton/ALaCarte">available here</a>, allowing you to re-create the results described.</p>
Tue, 18 Sep 2018 02:00:00 -0700
http://offconvex.github.io/2018/09/18/alacarte/
http://offconvex.github.io/2018/09/18/alacarte/When Recurrent Models Don't Need to be Recurrent<p>In the last few years, deep learning practitioners have proposed a litany of
different sequence models. Although recurrent neural networks were once the
tool of choice, now models like the autoregressive
<a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Wavenet</a> or the
<a href="https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html">Transformer</a>
are replacing RNNs on a diverse set of tasks. In this post, we explore the
trade-offs between recurrent and feed-forward models. Feed-forward models can
offer improvements in training stability and speed, while recurrent models are
strictly more expressive. Intriguingly, this added expressivity does not seem to
boost the performance of recurrent models. Several groups have shown
feed-forward networks can match the results of the best recurrent models on
benchmark sequence tasks. This phenomenon raises an interesting question for
theoretical investigation:</p>
<blockquote>
<p>When and why can feed-forward networks replace recurrent neural networks
without a loss in performance?</p>
</blockquote>
<p>We discuss several proposed answers to this question and highlight our
<a href="https://arxiv.org/abs/1805.10369">recent work</a> that offers an explanation in
terms of a fundamental stability property.</p>
<h1 id="a-tale-of-two-sequence-models">A Tale of Two Sequence Models</h1>
<h2 id="recurrent-neural-networks">Recurrent Neural Networks</h2>
<p>The many variants of recurrent models all have a similar form. The model
maintains a state $h_t$ that summarizes the past sequence of inputs. At each
time step $t$, the state is updated according to the equation
[
h_{t+1} = \phi(h_t, x_t),
]
where $x_t$ is the input at time $t$, $\phi$ is a differentiable map, and $h_0$
is an initial state. In a vanilla recurrent neural network, the model is
parameterized by matrices $W$ and $U$, and the state is updated according to
[
h_{t+1} = \tanh(Wh_t + Ux_t).
]
In practice, the <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Long Short-Term Memory
(LSTM)</a> network is
more frequently used. In either case, to make predictions, the state is passed
to a function $f$, and the model predicts $y_t = f(h_t)$. Since the state $h_t$
is a function of all of the past inputs $x_0, \dots, x_t$, the prediction $y_t$
depends on the entire history $x_0, \dots, x_t$ as well.</p>
<p>A recurrent model can also be represented graphically.</p>
<p style="text-align:center;">
<img src="/assets/approx_recurrent/recurrent_net.png" width="500px" height="250px" />
</p>
<p>Recurrent models are fit to data using backpropagation. However, backpropagating
gradients from time step $T$ to time step $0$ often requires infeasibly large
amounts of memory, so essentially every implementation of a recurrent model
<em>truncates</em> the model and only backpropagates gradient $k$ times steps.</p>
<figure>
<p style="text-align:center;">
<img src="/assets/approx_recurrent/truncated_backprop.png" />
</p>
<figcaption>
<small>
Source: <a href="https://r2rt.com/styles-of-truncated-backpropagation.html">
https://r2rt.com/styles-of-truncated-backpropagation.html </a>
</small>
</figcaption>
</figure>
<p>In this setup, the predictions of the recurrent model still depend on the entire
history $x_0, \dots, x_T$. However, it’s not clear how this training procedure
affects the model’s ability to learn long-term patterns, particularly those that
require more than $k$ steps.</p>
<h2 id="autoregressive-feed-forward-models">Autoregressive, Feed-Forward Models</h2>
<p>Instead of making predictions from a state that depends on the entire history,
an autoregressive model directly predicts $y_t$ using only the $k$ most recent
inputs, $x_{t-k+1}, \dots, x_{t}$. This corresponds to a strong <em>conditional
independence</em> assumption. In particular, a feed-forward model assumes the target
only depends on the $k$ most recent inputs. Google’s
<a href="https://arxiv.org/abs/1609.03499">WaveNet</a> nicely illustrates this general
principle.</p>
<figure>
<p style="text-align:center;">
<img src="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif" />
</p>
<figcaption>
<small>
Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">
https://deepmind.com/blog/wavenet-generative-model-raw-audio/</a>
</small>
</figcaption>
</figure>
<p>In contrast to an RNN, the limited context of a feed-forward model means that it
cannot capture patterns that extend more than $k$ steps. However, using
techniques like dilated-convolutions, one can make $k$ quite large.</p>
<h1 id="why-care-about-feed-forward-models">Why Care About Feed-Forward Models?</h1>
<p>At the outset, recurrent models appear to be a strictly more flexible and
expressive model class than feed-forward models. After all, feed-forward
networks make a strong conditional independence assumption that recurrent models
don’t make. Even if feed-forward models are less expressive, there are still
several reasons one might prefer a feed-forward network.</p>
<ul>
<li><strong>Parallelization</strong>: Convolutional feed-forward models are easier to <a href="https://arxiv.org/abs/1705.03122">parallelize
at training time</a>.
There’s no hidden state to update and maintain, and
therefore no sequential dependencies between outputs. This allows very
efficient implementations of training on modern hardware.</li>
<li><strong>Trainability</strong>: Training deep convolutional neural networks is the
bread-and-butter of deep learning. Whereas recurrent models are often more
finicky and difficult to <a href="https://arxiv.org/abs/1211.5063">optimize</a>,
significant effort has gone into designing architectures and software to
efficiently and reliably train deep feed-forward networks.</li>
<li><strong>Inference Speed</strong>: In some cases, feed-forward models can be significantly
more light-weight and perform <a href="https://arxiv.org/abs/1211.5063">inference faster than similar recurrent
systems</a>. In other cases,
particularly for long sequences, autoregressive inference is a large
bottleneck and requires <a href="https://arxiv.org/abs/1702.07825">significant engineering
work</a> or <a href="https://arxiv.org/abs/1711.10433">significant
cleverness</a> to overcome.</li>
</ul>
<h1 id="feed-forward-models-can-outperform-recurrent-models">Feed-Forward Models Can Outperform Recurrent Models</h1>
<p>Although it appears trainability and parallelization for feed-forward models
comes at the price of reduced accuracy, there have been several recent examples
showing that feed-forward networks can actually achieve the same accuracies as
their recurrent counterparts on benchmark tasks.</p>
<ul>
<li>
<p><strong>Language Modeling.</strong>
In language modeling, the goal is to predict the next word in a document given
all of the previous words. Feed-forward models make predictions using only the
$k$ most recent words, whereas recurrent models can potentially use the entire
document. The <a href="https://arxiv.org/abs/1612.08083">Gated-Convolutional Language
Model</a> is a feed-forward autoregressive models
that is competitive with <a href="https://arxiv.org/abs/1602.02410">large LSTM baseline
models</a>. Despite using a truncation length of
$k=25$, the model outperforms a large LSTM on the
<a href="https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset">Wikitext-103</a>
benchmark, which is designed to reward models that capture long-term
dependencies. On the <a href="http://www.statmt.org/lm-benchmark/">Billion Word
Benchmark</a>, the model is slightly worse
than the largest LSTM, but is faster to train and uses fewer resources.</p>
</li>
<li>
<p><strong>Machine Translation.</strong>
The goal in machine translation is to map sequences of English words to
sequences of, say, French words. Feed-forward models make translations using
only $k$ words of the sentence, whereas recurrent models can leverage the entire
sentence. Within the deep learning world, variants of the LSTM-based <a href="https://arxiv.org/abs/1409.0473">Sequence
to Sequence with Attention</a> model, particularly
<a href="https://arxiv.org/abs/1609.08144">Google Neural Machine Translation</a>, were
superseded first by a fully <a href="https://arxiv.org/abs/1705.03122">convolutional sequence to
sequence</a> model and then by the
<a href="https://arxiv.org/abs/1706.03762">Transformer</a>.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
</li>
</ul>
<figure>
<p style="text-align:center;">
<img src="https://raw.githubusercontent.com/facebookresearch/fairseq/master/fairseq.gif" />
</p>
<figcaption>
<small>
Source: <a href="https://github.com/facebookresearch/fairseq/blob/master/fairseq.gif">
https://github.com/facebookresearch/fairseq/blob/master/fairseq.gif </a>
</small>
</figcaption>
</figure>
<ul>
<li>
<p><strong>Speech Synthesis.</strong>
In speech synthesis, one seeks to generate a realistic human speech signal.
Feed-forward models are limited to the past $k$ samples, whereas recurrent
models can use the entire history. Upon publication, the feed-forward,
autoregressive <a href="https://arxiv.org/abs/1609.03499">WaveNet</a> was a substantial
improvement over LSTM-RNN parametric models.</p>
</li>
<li>
<p><strong>Everthing Else.</strong>
Recently <a href="https://arxiv.org/abs/1803.01271">Bai et al.</a> proposed a generic
feed-forward model leveraging dilated convolutions and showed it outperforms
recurrent baselines on tasks ranging from synthetic copying tasks to music
generation.</p>
</li>
</ul>
<h1 id="how-can-feed-forward-models-outperform-recurrent-ones">How Can Feed-Forward Models Outperform Recurrent Ones?</h1>
<p>In the examples above, feed-forward networks achieve results on par with or
better than recurrent networks. This is perplexing since recurrent models
seem to be more powerful a priori. One explanation for this phenomenon is
given by <a href="https://arxiv.org/abs/1612.08083">Dauphin et al.</a>:</p>
<blockquote>
<p>The unlimited context offered by recurrent models is not strictly necessary
for language modeling.</p>
</blockquote>
<p>In other words, it’s possible you don’t need a large amount of context to do
well on the prediction task on average. <a href="https://arxiv.org/abs/1612.02526">Recent theoretical
work</a> offers some evidence in favor of this view.</p>
<p>Another explanation is given by <a href="https://arxiv.org/abs/1803.01271">Bai et al.</a>:</p>
<blockquote>
<p>The “infinite memory” advantage of RNNs is largely absent in practice.</p>
</blockquote>
<p>As Bai et al. report, even in experiments explicitly requiring long-term
context, RNN variants were unable to learn long sequences. On the Billion Word
Benchmark, an <a href="https://arxiv.org/abs/1703.10724">intriguing Google Technical
Report</a> suggests an LSTM $n$-gram model with
$n=13$ words of memory is as good as an LSTM with arbitrary context.</p>
<p>This evidence leads us to conjecture: <strong>Recurrent models <em>trained in practice</em>
are effectively feed-forward.</strong> This could happen either because truncated
backpropagation time cannot learn patterns significantly longer than $k$ steps,
or, more provocatively, because models <em>trainable by gradient descent</em> cannot
have long-term memory.</p>
<p>In <a href="https://arxiv.org/abs/1805.10369">our recent paper</a>, we study the gap
between recurrent and feed-forward models trained using gradient descent. We
show if the recurrent model is <em>stable</em> (meaning the gradients can not explode),
then the model can be well-approximated by a feed-forward network for the
purposes of both <em>inference and training.</em> In other words, we show feed-forward
and stable recurrent models trained by gradient descent are <em>equivalent</em> in the
sense of making identical predictions at test-time. Of course, not all models
trained in practice are stable. We also give empirical evidence the stability
condition can be imposed on certain recurrent models without loss in
performance.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Despite some initial attempts, there is still much to do to understand
why feed-forward models are competitive with recurrent ones and
shed light onto the trade-offs between sequence models. How much memory is
really needed to perform well on common sequence benchmarks? What are the
expressivity trade-offs between truncated RNNs (which can be considered
feed-forward) and the convolutional models that are in popular use? Why can
feed-forward networks perform as well as unstable RNNs in practice?</p>
<p>Answering these questions is a step towards building a theory that can both
explain the strengths and limitations of our current methods and give guidance
about how to choose between different classes of models in concrete settings.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The Transformer isn’t strictly a feed-forward model in the style described above (since it doesn’t make the $k$ step conditional independence assumption), but is not really a recurrent model because it doesn’t maintain a hidden state. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 27 Jul 2018 01:00:00 -0700
http://offconvex.github.io/2018/07/27/approximating-recurrent/
http://offconvex.github.io/2018/07/27/approximating-recurrent/Deep-learning-free Text and Sentence Embedding, Part 2<p>This post continues <a href="http://www.offconvex.org/2018/06/17/textembeddings/">Sanjeev’s post</a> and describes further attempts to construct elementary and interpretable text embeddings.
The previous post described the <a href="https://openreview.net/pdf?id=SyK00v5xx">the SIF embedding</a>, which uses a simple weighted combination of word embeddings combined with some mild “denoising” based upon singular vectors, yet outperforms many deep learning based methods, including <a href="https://arxiv.org/pdf/1506.06726.pdf">Skipthought</a>, on certain downstream NLP tasks such as sentence semantic similarity and entailment.
See also this <a href="http://nlp.town/blog/sentence-similarity/">independent study by Yves Peirsman</a>.</p>
<p>However, SIF embeddings embeddings ignore word order (similar to classic <em>Bag of Words</em> models in NLP), which leads to unexciting performance on many other downstream classification tasks.
(Even the denoising via SVD, which is crucial in similarity tasks, can sometimes reduces performance on other tasks.)
Can we design a text embedding with the simplicity and transparency of SIF while also incorporating word order information?
Our <a href="https://openreview.net/pdf?id=B1e5ef-C-">ICLR’18 paper</a> with Kiran Vodrahalli does this, and achieves strong empirical performance and also some surprising theoretical guarantees stemming from the <a href="https://en.wikipedia.org/wiki/Compressed_sensing">theory of compressed sensing</a>.
It is competitive with all pre-2018 LSTM-based methods on standard tasks.
Even better, it is much faster to compute, since it uses pretrained (GloVe) word vectors and simple linear algebra.</p>
<p style="text-align:center;">
<img src="/assets/unsupervised_pipeline.png" width="50%" alt="Pipeline" />
</p>
<h2 id="incorporating-local-word-order-n-gram-embeddings">Incorporating local word order: $n$-gram embeddings</h2>
<p><em>Bigrams</em> are ordered word-pairs that appear in the sentence, and $n$-grams are ordered $n$-tuples.
A document with $k$ words has $k-1$ bigrams and $k-n+1$ $n$-grams.
The <em>Bag of n-gram (BonG) representation</em> of a document refers to a long vector whose each entry is indexed by all possible $n$-grams, and contains the number of times the corresponding $n$-gram appears in the document.
Linear classifiers trained on BonG representations are a <a href="https://www.aclweb.org/anthology/P12-2018">surprisingly strong baseline for document classification tasks</a>.
While $n$-grams don’t directly encode long-range dependencies in text, one hopes that a fair bit of such information is implicitly present.</p>
<p>A trivial idea for incorporating $n$-grams into SIF embeddings would be to treat $n$-grams like words, and compute word embeddings for them using either GloVe and word2vec.
This runs into the difficulty that the number of distinct $n$-grams in the corpus gets very large even for $n=2$ (let alone $n=3$), making it almost impossible to solve word2vec or GloVe.
Thus one gravitates towards a more <em>compositional</em> approach.</p>
<blockquote>
<p><strong>Compositional $n$-gram embedding:</strong> Represent $n$-gram $g=(w_1,\dots,w_n)$ as the element-wise product $v_g=v_{w_1}\odot\cdots\odot v_{w_n}$ of the embeddings of its constituent words.</p>
</blockquote>
<p>Note that due to the element-wise multiplication we actually represent unordered $n$-gram information, not ordered $n$-grams (the performance for order-preserving methods is about the same).
Now we are ready to define our <em>Distributed Co-occurrence (DisC) embeddings</em>.</p>
<blockquote>
<p>The <strong>DisC embedding</strong> of a piece of text is just a concatenation for $(v_1, v_2, \ldots)$ where $v_n$ is the sum of the $n$-gram embeddings of all $n$-grams in the document (for $n=1$ this is just the sum of word embeddings).</p>
</blockquote>
<p>Note that DisC embeddings leverage classic Bag-of-n-Gram information as well as the power of word embeddings.
For instance, the sentences <em>“Loved this movie!”</em> and <em>“I enjoyed the film.”</em> share no $n$-gram information for any $n$, but their DisC embeddings are fairly similar.
Thus if the first example comes with a label, it gives the learner some idea of how to classify the second.
This can be useful especially in settings with few labeled examples; e.g. DisC outperform BonG on the Stanford Sentiment Treebank (SST) task, which has only 6,000 labeled examples.
DisC embeddings also beat SIF and a standard LSTM-based method, Skipthoughts.
On the much larger IMDB testbed, BonG still reigns at top (although DisC is not too far behind).</p>
<div style="text-align:center;">
<img src="/assets/clfperf_sst_imdb.png" width="70%" alt="Performance on SST and IMDB" />
</div>
<p>Skip-thoughts does match or beat our DisC embeddings on some other classification tasks, but that’s still not too shabby an outcome for such a simple method. (By contrast, LSTM methods can take days or weeks of training, and are quite slow to evaluate at test time on a new piece of text.)</p>
<div style="text-align:center;">
<img src="/assets/sentenceembedtable.jpg" width="70%" alt="Performance on various classification tasks" />
</div>
<h2 id="some-theoretical-analysis-via-compressed-sensing">Some theoretical analysis via compressed sensing</h2>
<p>A linear SIF-like embedding represents a document with Bag-of-Words vector $x$ as
<script type="math/tex">\sum_w \alpha_w x_w v_w,</script>
where $v_w$ is the embedding of word $w$ and $\alpha_w$ is a scaling term.
In other words, it represents document $x$ as $A x$ where $A$ is the matrix with as many columns as the number of words in the language, and the column corresponding to word $w$ is $\alpha_w A$.
Note that $x$ has many zero coordinates corresponding to words that don’t occur in the document; in other words it’s a <em>sparse</em> vector.</p>
<p>The starting point of our DisC work was the realization that perhaps the reason SIF-like embeddings work reasonably well is that they <em>preserve</em> the Bag-of-words information, in the sense that it may be possible to <em>easily recover</em> $x$ from $A$.
This is not an outlandish conjecture at all, because <a href="https://en.wikipedia.org/wiki/Compressed_sensing"><em>compressed sensing</em></a> does exactly this when $x$ is suitably sparse and matrix $A$ has some nice properties such as RIP or incoherence.
A classic example is when $A$ is a random matrix, which in our case corresponds to using random vectors as word embeddings.
Thus one could try to use random word embeddings instead of GloVe vectors in the construction and see what happens!
Indeed, we find that so long as we raise the dimension of the word embeddings, then text embeddings using random vectors do indeed converge to the performance of BonG representations.</p>
<p>This is a surprising result, as compressed sensing does not imply this per se, since the ability to reconstruct the BoW vector from its compressed version doesn’t directly imply that the compressed version gives the same performance as BoW on linear classification tasks.
However, a result of <a href="https://pdfs.semanticscholar.org/627c/14fe9097d459b8fd47e8a901694198be9d5d.pdf">Calderbank, Jafarpour, & Schapire</a> shows that the compressed sensing condition that implies optimal recovery also implies good performance on linear classification under compression. Intuitively, this happens because of two facts.</p>
<p><script type="math/tex">\mbox{1) Optimum linear classifier $c^*$ is convex combination of datapoints.} \quad c^* = \sum_{i}\alpha_i x_i.</script>
<script type="math/tex">% <![CDATA[
\mbox{(2) RIP condition implies} <Ax, Ax'> \approx <x, x'>~\mbox{for $k$-sparse}~x, x'. %]]></script></p>
<p>Furthermore, by extending these ideas to the $n$-gram case, we show that our DisC embeddings computed using random word vectors, which can be seen as a linear compression of the BonG representation, can do as well as the original BonG representation on linear classification tasks. To do this we prove that the “sensing” matrix $A$ corresponding to DisC embeddings satisfy the <em>Restricted Isometry Property (RIP)</em> introduced in the seminal paper of <a href="https://statweb.stanford.edu/~candes/papers/DecodingLP.pdf">Candes & Tao</a>. The theorem relies upon <a href="http://www.cis.pku.edu.cn/faculty/vision/zlin/A%20Mathematical%20Introduction%20to%20Compressive%20Sensing.pdf">compressed sensing results for bounded orthonormal systems</a> and says that then the performance of DisC embeddings on linear classification tasks approaches that of BonG vectors as we increase the dimension.
Please see our paper for details of the proof.</p>
<p>It is worth noting that our idea of composing objects (words) represented by random vectors to embed structures ($n$-grams/documents) is closely related to ideas in neuroscience and neural coding proposed by <a href="http://www2.fiit.stuba.sk/~kvasnicka/CognitiveScience/6.prednaska/plate.ieee95.pdf">Tony Plate</a> and <a href="http://www.rctn.org/vs265/kanerva09-hyperdimensional.pdf">Pentti Kanerva</a>.
They also were interested in how these objects and structures could be recovered from the representations;
we take the further step of relating the recoverability to performance on a downstream linear classification task.
Text classification over compressed BonG vectors has been proposed before by <a href="https://papers.nips.cc/paper/4932-compressive-feature-learning.pdf">Paskov, West, Mitchell, & Hastie</a>, albeit with a more complicated compression that does not achieve a low-dimensional representation (dimension >100,000) due to the use of classical lossless algorithms rather than linear projection.
Our work ties together these ideas of composition and compression into a simple text representation method with provable guarantees.</p>
<h2 id="a-surprising-lower-bound-on-the-power-of-lstm-based-text-representations">A surprising lower bound on the power of LSTM-based text representations</h2>
<p>The above result also leads to a new theorem about deep learning: <em>text embeddings computed using low-memory LSTMs can do at least as well as BonG representations on downstream classification tasks</em>.
At first glance this result may seem uninteresting: surely it’s no surprise that the field’s latest and greatest method is at least as powerful as its oldest?
But in practice, most papers on LSTM-based text embeddings make it a point to compare to performance of BonG baseline, and <em>often are unable to improve upon that baseline</em>!
Thus empirically this new theorem had not been clear at all! (One reason could be that our theory requires the random embeddings to be somewhat higher dimensional than the LSTM work had considered.)</p>
<p>The new theorem follows from considering an LSTM that uses random vectors as word embeddings and computes the DisC embedding in one pass over the text. (For details see our appendix.)</p>
<p>We empirically tested the effect of dimensionality by measuring performance of DisC on IMDb sentiment classification.
As our theory predicts, the accuracy of DisC using random word embeddings converges to that of BonGs as dimensionality increases. (In the figure below “Rademacher vectors” are those with entries drawn randomly from $\pm1$.) Interestingly we also find that DisC using pretrained word embeddings like GloVe reaches BonG performance at much smaller dimensions, an unsurprising but important point that we will discuss next.</p>
<div style="text-align:center;">
<img src="/assets/imdbperf_uni_bi.png" width="60%" />
</div>
<h2 id="unexplained-mystery-higher-performance-of-pretrained-word-embeddings">Unexplained mystery: higher performance of pretrained word embeddings</h2>
<p>While compressed sensing theory is a good starting point for understanding the power of linear text embeddings, it leaves some mysteries.
Using pre-trained embeddings (such as GloVe) in DisC gives higher performance than random embeddings, both in recovering the BonG information out of the text embedding, as well as in downstream tasks. However, pre-trained embeddings do not satisfy some of the nice properties assumed in compressed sensing theory such as RIP or incoherence, since those properties forbid pairs of words having similar embeddings.</p>
<p>Even though the matrix of embeddings does not satisfy these classical compressed sensing properties, we find that using Basis Pursuit, a sparse recovery approach related to LASSO with provable guarantees for RIP matrices, we can recover Bag-of-Words information better using GloVe-based text embeddings than from embeddings using random word vectors (measuring success via the $F_1$-score of the recovered words — higher is better).</p>
<div style="text-align:center;">
<img src="/assets/recovery.png" width="60%" />
</div>
<p>Note that random embeddings are better than pretrained embeddings at recovering words from random word salad (the right-hand image).
This suggests that pretrained embeddings are specialized — thanks to their training on a text corpus — to do well only on real text rather than a random collection of words.
It would be nice to give a mathematical explanation for this phenomenon.
We suspect that this should be possible using a result of <a href="http://www.pnas.org/content/pnas/102/27/9446.full.pdf">Donoho & Tanner</a>, which we use to show that words in a document can be recovered from the sum of word vectors if and only if there is a hyperplane containing the vectors for words in the document with the vectors for all other words on one side of it.
Since co-occurring words will have similar embeddings, that should make it easier to find such a hyperplane separating words in a document from the rest of the words and hence would ensure good recovery.</p>
<p>However, even if this could be made more rigorous, it would only imply sparse recovery, not good performance on classification tasks.
Perhaps assuming a generative model for text, like the RandWalk model discussed in an <a href="https://www.offconvex.org/2016/02/14/word-embeddings-2/">earlier post</a>, could help move this theory forward.</p>
<h2 id="discussion">Discussion</h2>
<p>Could we improve the performance of such simple embeddings even further?
One promising idea is to define better $n$-gram embeddings than the simple compositional embeddings defined in DisC.
An independent <a href="https://arxiv.org/abs/1703.02507">NAACL’18 paper</a> of Pagliardini, Gupta, & Jaggi proposes a text embedding similar to DisC in which unigram and bigram embeddings are trained specifically to be added together to form sentence embeddings, also achieving good results, albeit not as good as DisC.
(Of course, their training time is higher than ours.)
In our upcoming <a href="https://arxiv.org/abs/1805.05388">ACL’18 paper</a> with Yingyu Liang, Tengyu Ma, & Brandon Stewart we give a very simple and efficient method to induce embeddings for $n$-grams as well as other rare linguistic features that improves upon DisC and beats skipthought on several other benchmarks.
This will be described in a future blog post.</p>
<p>Sample code for constructing and evaluating DisC embeddings is <a href="https://github.com/NLPrinceton/text_embedding">available</a>, as well as <a href="https://github.com/NLPrinceton/sparse_recovery">solvers</a> for recreating the sparse recovery results for word embeddings.</p>
Mon, 25 Jun 2018 03:00:00 -0700
http://offconvex.github.io/2018/06/25/textembeddings/
http://offconvex.github.io/2018/06/25/textembeddings/Deep-learning-free Text and Sentence Embedding, Part 1<p>Word embeddings (see my old <a href="http://www.offconvex.org/2015/12/12/word-embeddings-1/">post1</a> and
<a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a>) capture the idea that one can express “meaning” of words using a vector, so that the cosine of the angle between the vectors captures semantic similarity. (“Cosine similarity” property.) Sentence embeddings and text embeddings try to achieve something similar: use a fixed-dimensional vector to represent a small piece of text, say a sentence or a small paragraph. The performance of such embeddings can be tested via the Sentence Textual Similarity (STS) datasets (see the <a href="http://ixa2.si.ehu.es/stswiki/index.php/Main_Page">wiki page</a>), which contain sentence pairs humanly-labeled with similarity ratings.</p>
<p style="text-align:center;">
<img src="/assets/textembeddingvectorslide.jpg" width="30%" alt="What are text embeddings." />
</p>
<p>A general hope behind computing text embeddings is that they can be learnt using a large <em>unlabeled</em> text corpus (similar to word embeddings) and then allow good performance on downstream classification tasks with few <em>labeled</em> examples. Thus the overall pipeline could look like this:</p>
<p style="text-align:center;">
<img src="/assets/textembeddingpipeline.jpg" width="80%" alt="How are text embeddings used in downstream classification task." />
</p>
<p>Computing such representations is a form of <a href="http://www.offconvex.org/2017/06/26/unsupervised1/">representation learning as well as unsupervised learning</a>. This post will be an introduction to <strong>extremely simple</strong> ways of computing sentence embeddings, which on many standard tasks, beat many state-of-the-art deep learning methods. This post is based upon <a href="https://openreview.net/pdf?id=SyK00v5xx">my ICLR’17 paper on SIF embeddings</a> with Yingyu Liang and Tengyu Ma.</p>
<h2 id="existing-methods">Existing methods</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=2133826">Topic modeling</a> is a classic technique for unsupervised learning on text and it also yields a vector representation for a paragraph (or longer document), specifically, the vector of “topics” occuring in this document and their relative proportions. Unfortunately, topic modeling is not accurate at producing good representations at the sentence or short paragraph level, and furthermore there appears to be no variant of topic modeling that leads to the good cosine similarity property that we desire.</p>
<p><em>Recurrent neural net</em> is the default deep learning technique to train a <a href="https://www.tensorflow.org/tutorials/recurrent">language model</a>. It scans the text from left to right, maintaining a fixed-dimensional vector-representation of the text it has seen so far. It’s goal is to use this representation to predict the next word at each time step, and the training objective is to maximise log-likelihood of the data (or similar). Thus for example, a well-trained model when given a text fragment <em>“I went to the cafe and ordered a ….”</em> would assign high probability to <em>“coffee”, “croissant”</em> etc. and low probability to <em>“puppy”</em>. Myriad variations of such language models exist, many using biLSTMs which have some long-term memory and can scan the text forward and backwards. Lately biLSTMs have been replaced by convolutional architectures with attention mechanism; see for instance <a href="http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf">this paper</a>.</p>
<p>One obtains a text representation by peeking at the internal representation (i.e., node activations) at the top layer of this deep model. After all, when the model is scanning through text, its ability to predict the next word must imply that this internal representation implicitly captures a gist of all it has seen, reflecting rules of grammar, common-sense etc. (e.g., that you don’t order a puppy at a cafe). Some notable modern efforts along such lines are <a href="https://arxiv.org/abs/1506.01057">Hierarchichal Neural Autoencoder of Li et al.</a> as well as <a href="https://arxiv.org/abs/1502.06922">Palangi et al</a>, and <a href="https://arxiv.org/abs/1506.06726"><em>Skipthought</em> of Kiros et al.</a>.</p>
<p>As with all deep learning models, one wishes for interpretability: what information exactly did the machine choose to put into the text embedding? Besides <a href="https://people.csail.mit.edu/beenkim/papers/BeenK_FinaleDV_ICML2017_tutorial.pdf">the usual reasons for seeking interpretability</a>, in an NLP context it may help us leverage additional external resources such as <a href="https://wordnet.princeton.edu/">WordNet</a> in the task. Other motivations include
transfer learning/domain adaptation (to solve classification tasks for a small text corpus, leverage text embeddings trained on a large unrelated corpus).</p>
<h2 id="surprising-power-of-simple-linear-representations">Surprising power of simple linear representations</h2>
<p>In practice, many NLP applications rely on a simple sentence embedding: the average of the embeddings of the words in it. This makes some intuitive sense, because recall that the <a href="https://arxiv.org/pdf/1310.4546.pdf">Word2Vec paper</a> uses the following expression (in the their simpler CBOW word embedding)</p>
<script type="math/tex; mode=display">\Pr[w~|~w_1,w_2, w_3, w_4, w_5] \propto \exp(v_w \cdot (\frac{1}{5} \sum_i v_{w_i}). \qquad (1)</script>
<p>which suggests that the sense of a sequence of words is captured via simple average of word vectors.</p>
<p>While this simple average has only fair performance in capturing sentence similarity via cosine similarity, it can be quite powerful in downstream classification tasks (after passing through a single layer neural net) as shown in a
surprising paper of <a href="https://arxiv.org/abs/1511.08198">Wieting et al. ICLR’16</a>.</p>
<h2 id="better-linear-representation-sif-embeddings">Better linear representation: SIF embeddings</h2>
<p>My <a href="https://openreview.net/pdf?id=SyK00v5xx">ICLR’17 paper</a> with Yingyu Liang and Tengyu Ma improved such simple averaging using our <strong>SIF</strong> embeddings. They’re motivated by the empirical observation that word embeddings have various pecularities stemming from the training method, which tries to capture word cooccurence probabilities using vector inner product, and words sometimes occur out of context in documents. These anomalies cause the average of word vectors to have nontrivial components along semantically meaningless directions. SIF embeddings try to combat this in two ways, which I describe intuitively first, followed by more theoretical justification.</p>
<p><strong>Idea 1: Nonuniform weighting of words.</strong>
Conventional wisdom in information retrieval holds that “frequent words carry less signal.” Usually this is captured via <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF weighting</a>, which assigns weightings to words inversely proportional to their frequency. We introduce a new variant we call <em>Smoothed Inverse Frequency</em> (SIF) weighting,
which assigns to word $w$ a weighting $\alpha_w = a/(a+ p_w)$ where $p_w$ is the frequency of $w$ in the corpus and $a$ is a hyperparameter. Thus the embedding of a piece of text is $\sum_w \alpha_w v_w$ where the sum is over words in it.
(Aside: word frequencies can be estimated from any sufficiently large corpus; we find embedding quality to be not too dependent upon this.)</p>
<p>On a related note, we found that folklore understanding of word2vec, viz., expression (1), is <em>false.</em> A dig into the code reveals a resampling trick that is tantamount to a weighted average quite similar to our SIF weighting. (See Section 3.1 in our paper for a discussion.)</p>
<p><strong>Idea 2: Remove component from top singular direction.</strong>
The next idea is to modify the above weighted average by removing the component in a special direction, corresponding to the top singular direction set of weighted embeddings of a smallish sample of sentences from the domain (if doing domain adaptation, component is computed using sentences of the target domain). The paper notes that the direction corresponding to the top singular vector tends to contain information related to grammar and stop words, and removing the component in this subspace really cleans up the text embedding’s ability to express meaning.</p>
<h2 id="theoretical-justification">Theoretical justification</h2>
<p>A notable part of our paper is to give a theoretical justification for this weighting using a generative model for text similar to one used in our <a href="http://aclweb.org/anthology/Q16-1028">word embedding paper in TACL’16</a> as described in <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">my old post</a>.
That model tries to give the causative relationship between word meanings and their cooccurence probabilities. It thinks of corpus generation as a dynamic process, where the $t$-th word is produced at step $t$. The model says that the process is driven by the random walk of a <em>discourse</em> vector $c_t \in \Re^d$. It is a unit vector whose direction in space represents <em>what is being talked about.</em>
Each word has a (time-invariant) latent vector $v_w \in \Re^d$ that captures its correlations with the discourse vector. We model this bias with a loglinear word production model:</p>
<script type="math/tex; mode=display">\Pr[w~\mbox{emitted at time $t$}~|~c_t] \propto \exp(c_t\cdot v_w). \qquad (2)</script>
<p>The discourse vector does a slow geometric random walk over the unit sphere in $\Re^d$. Thus $c_{t+1}$ is obtained by a small random displacement from $c_t$. Since expression (2) places much higher probability on words that are clustered around $c_t$, and $c_t$ moves slowly. If the discourse vector moves slowly, then we can assume a single discourse vector gave rise to the entire sentence or short paragraph. Thus given a sentence, a plausible vector representation of its “meaning” is a <em>max a posteriori</em> (MAP) estimate of the discourse vector that generated it.</p>
<p>Such models have been empirically studied for a while, but our paper gave a theoretical analysis, and showed that various subcases imply standard word embedding methods such as word2vec and GloVe. For example, it shows that MAP estimate of the discourse vector is the simple average of the embeddings of the preceding $k$ words – in other words, the average word vector!</p>
<p>This model is clearly simplistic and our ICLR’17 paper suggests two correction terms, intended to account for words occuring out of context, and to allow some common words (<em>“the”, “and”, “but”</em> etc.) appear often regardless of the discourse. We first introduce an additive term $\alpha p(w)$ in the log-linear model, where $p(w)$ is the unigram probability (in the entire corpus) of word and $\alpha$ is a scalar. This allows words to occur even if their vectors have very low inner products with $c_s$.
Secondly, we introduce a common discourse vector $c_0\in \Re^d$ which serves as a correction term for the most frequent discourse that is often related to syntax. It boosts the co-occurrence probability of words that have a high component along $c_0$.(One could make other correction terms, which are left to future work.) To put it another way, words that need to appear a lot out of context can do so by having a component along $c_0$, and the size of this component controls its probability of appearance out of context.</p>
<p>Concretely, given the discourse vector $c_s$ that produces sentence $s$, the probability of a word $w$ is emitted in the sentence $s$ is modeled as follows, where $\tilde{c}_{s} = \beta c_0 + (1-\beta) c_s, c_0 \perp c_s$,
$\alpha$ and $\beta$ are scalar hyperparameters:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Pr[w \mid s] = \alpha p(w) + (1-\alpha) \frac{\exp(<\tilde{c}_{s}, v_w>)}{Z_{\tilde{c,s}}}, %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
Z_{\tilde{c,s}} = \sum_{w} \exp(<\tilde{c}_{s}, v_w>) %]]></script>
<p>is the normalizing constant (the partition function). We see that the model allows a word $w$ unrelated to the discourse $c_s$ to be emitted for two reasons: a) by chance from the term $\alpha p(w)$; b) if $w$ is correlated with the common direction $c_0$.</p>
<p>The paper shows that the MAP estimate of the $c_s$ vector corresponds to the SIF embeddings described earlier, where the top singular vector used in their construction is an estimate of the $c_0$ vector in the model.</p>
<h2 id="empirical-performance">Empirical performance</h2>
<p>The performance of this embedding scheme appears in the figure below. Note that Wieting et al. had already shown that their method (which is semi-supervised, relying upon a large unannotated corpus and a small annotated corpus) beats many LSTM-based methods. So this table only compares to their work; see the papers for comparison with more past work.</p>
<p style="text-align:center;">
<img src="/assets/textembedexperiments.jpg" width="80%" alt="Performance of our embedding on downstream classification tasks" />
</p>
<p>For other performance results please see the paper.</p>
<h2 id="next-post">Next post</h2>
<p>In the next post, I will sketch improvements to the above embedding in two of our new papers. Special guest appearance: Compressed Sensing (aka Sparse Recovery).</p>
<p>The SIF embedding package is available <a href="https://github.com/PrincetonML/SIF">from our github page</a></p>
Sun, 17 Jun 2018 03:00:00 -0700
http://offconvex.github.io/2018/06/17/textembeddings/
http://offconvex.github.io/2018/06/17/textembeddings/Limitations of Encoder-Decoder GAN architectures<p>This is yet another post about <a href="http://www.offconvex.org/2017/03/15/GANs/">Generative Adversarial Nets (GANs)</a>, and based upon our new <a href="https://openreview.net/forum?id=BJehNfW0-">ICLR’18 paper</a> with Yi Zhang. A quick recap of the story so far. GANs are an unsupervised method in deep learning to learn interesting distributions (e.g., images of human faces), and also have a plethora of uses for image-to-image mappings in computer vision. Standard GANs training is motivated using this task of distribution learning, and is designed with the idea that given large enough deep nets and enough training examples, as well as accurate optimization, GANs will learn the full distribution.</p>
<p><a href="http://www.offconvex.org/2017/03/30/GANs2/">Sanjeev’s previous post</a> concerned <a href="https://arxiv.org/abs/1703.00573">his co-authored ICML’17 paper</a> which called this intuition into question when the deep nets have finite capacity. It shows that the training objective has near-equilibria where the discriminator is fooled —i.e., training objective is good—but the generator’s distributions has very small support, i.e. shows <em>mode collapse.</em> This is a failure of the model, and raises the question whether such bad equilibria are found in real-life training. A <a href="http://www.offconvex.org/2017/07/07/GANs3/">second post</a> showed empirical evidence that they do, using the birthday-paradox test.</p>
<p>The current post concerns our <a href="https://arxiv.org/abs/1711.02651">new result</a> (part of our upcoming <a href="https://openreview.net/forum?id=BJehNfW0-">ICLR paper</a>) which shows that bad equilibria exist also in more recent GAN architectures based on simultaneously learning an <em>encoder</em> and <em>decoder</em>. This should be surprising because many researchers believe that encoder-decoder architectures fix many issues with GANs, including mode collapse.</p>
<p>As we will see, encoder-decoder GANs seem very powerful. In particular, the proof of the previously mentioned <a href="http://www.offconvex.org/2017/03/30/GANs2/">negative result</a> utterly breaks down for this architecture. But, we then discovered a cute argument that shows encoder-decoder GANs can have poor solutions, featuring not only mode collapse but also encoders that map images to nonsense (more precisely Gaussian noise). This is the worst possible failure of the model one could imagine.</p>
<h2 id="encoder-decoder-architectures">Encoder-decoder architectures</h2>
<p>Encoders and decoders have long been around in machine learning in various forms – especially deep learning. Speaking loosely, underlying all of them are two basic assumptions: <br />
(1) Some form of the so-called <a href="https://mitpress.mit.edu/sites/default/files/titles/content/9780262033589_sch_0001.pdf"><em>manifold assumption</em></a> which asserts that high-dimensional data such as real-life images lie (roughly) on a low-dimensional manifold. (“Manifold” should be interpreted rather informally – sometimes this intuition applies only very approximately sometimes it’s meant in a “distributional” sense, etc.) <br />
(2) The low-dimensional structure is “meaningful”: if we think of an image $x$ as a high-dimensional vector and its “code” $z$ as its coordinates on the low-dimensional manifold, the code $z$ is thought of as a “high-level” descriptor of the image.</p>
<p>With the above two points in mind, an <em>encoder</em> maps the image to its code, and a <em>decoder</em> computes the reverse map. (We also discussed encoders and decoders in <a href="http://www.offconvex.org/2017/06/27/unsupervised1/">our earlier post on representation learning</a> in a more general setup.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_manifold2.jpg" width="80%" alt="Manifold structure" />
</p>
<h2 id="encoder-decoder-gans">Encoder-Decoder GANs</h2>
<p>These were introduced by <a href="https://arxiv.org/abs/1606.00704">Dumoulin et al.(ALI)</a> and <a href="https://arxiv.org/abs/1605.09782">Donahue et al.(BiGAN)</a>. They involve two competitors: Player 1 involves a discriminator net $D$ that is given an input of the form (image, code) and it outputs a number in the interval $[0,1]$, which denotes its “satisfaction level” with this input. Player 2 trains a decoder net $G$ (also called <em>generator</em> in the GANs setting) and an encoder net $E$.</p>
<p>Recall that in the standard GAN, discriminator tries to distinguish real images from images generated by the generator $G$. Here
discriminator’s input is an image and its code. Specifically, Player 1 is trying to train its net to distinguish between the following two settings, and Player 2 is trying to make sure the two settings look indistinguishable to Player 1’s net.</p>
<p><script type="math/tex">\mbox{Setting 1: presented with}~(x, E(x))~\mbox{where $x$ is random real image}.</script>
<script type="math/tex">\mbox{Setting 2: presented with}~(G(z), z)~\mbox{where $z$ is random code}.</script></p>
<p>(Here it is assumed that a random code is a vector with i.i.d gaussian coordinates, though one could consider other distributions.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_2settings_v2.jpg" width="80%" alt="Two settings which discriminator net has to distinguish between" />
</p>
<p>The hoped-for equilibrium obviously is one where generator and encoder are inverses of each other: $E(G(z)) \approx z$ and $G(E(x)) \approx x$, and the joint distributions $(z,G(z))$ and $(E(x), x)$ roughly match.
The underlying intuition is that if this happens, Player 1 must’ve produced a “meaningful” representation $E(x)$ for the images – and this should improve the quality of the generator as well.
Indeed, <a href="https://arxiv.org/abs/1606.00704">Dumoulin et al.(ALI)</a> provide some small-scale empirical examples on mixtures of Gaussians for which encoder-decoder architectures seem to ameliorate the problem of mode collapse.</p>
<p>The above papers prove that when the encoder/decoder/discriminator have infinite capacity, the desired solution is indeed an equilibrium. However, we’ll see that things are very different when capacities are finite.</p>
<h2 id="finite-capacity-discriminators-are-weak">Finite-capacity discriminators are weak</h2>
<p>Say a generator/encoder pair $(G,E)$ $\epsilon$-<em>fools</em> a decoder $D$ if</p>
<script type="math/tex; mode=display">|E_{x} D(x, E(x)) - E_{z} D(G(z), z)| \leq \epsilon</script>
<p>In other words, $D$ has roughly similar output in Settings 1 and 2.</p>
<p>Our theorem applies when the distribution consists of realistic images, as explained later. We show the following:</p>
<blockquote>
<p>(Informal theorem) If the discriminator $D$ has capacity (i.e. number of parameters) at most $p$, then there is an encoder $E$ of capacity $\ll p$ and generator $G$ of slightly larger capacity than $p$ such that $(G, E)$ can $\epsilon$-fool every such $D$. Furthermore, the generator exhibits mode collapse: its distribution is essentially supported on a bit more than $p$ images, and the encoder $E$ just outputs white noise (i.e. does not extract any “meaningful” representation) given an image.</p>
</blockquote>
<p>(Note that such a $(G, E)$ represents an $\epsilon$-approximate equilibrium, in the sense that player 1 cannot gain more than $\epsilon$ in the distinguishing probability by switching its discriminator. )</p>
<p>It is important that the encoder’s capacity is much less than $p$, and thus the theorem allows a discriminator that is able to simulate $E$ if it needed, and in particular verify for a random seed $z$ that $E(G(z)) \approx z$. The theorem says that even the ability to conduct such a verification cannot give it power to force encoder to produce meaningful codes. This is a counterintuitive aspect of the result. The main difficulty in the proof (which stumped us for a bit) was how to exhibit such an equilibrium where $E$ is a small net.</p>
<p>This is ensured by a simple assumption. We assume the image distribution is mildly “noised”: say, every 100th pixel is replaced by Gaussian noise. To a human, such an image would of course be indistinguishable from a real image. (NB: Our proof could be carried out via some other assumptions to the effect that images have an innate stochastic/noise component that is efficiently extractable by a small neural network. But let’s keep things clean.) When noise $\eta$ is thus added to an image $x$, we denote the resulting image as $x \odot \eta$.</p>
<p>Now the encoder will be rather trivial: given the noised image $x \odot \eta$, output $\eta$. Clearly, such an encoder does not in any sense capture “meaning” in the image. It is also implementable by a tiny single-layer net, as required by the theorem.</p>
<h3 id="construction-of-generator">Construction of generator</h3>
<p>As usual in the GAN literature, we will assume the discriminator is $L$-<a href="https://www.encyclopediaofmath.org/index.php/Lipschitz_constant">Lipschitz</a>. This can be a loose upperbound, since only $\log L$ enters quantitatively in the proof.</p>
<p>The generator $G(z)$ in the theorem statement memorizes a hash function that partitions the set of all seeds/codes $z$ into $m$ equal-sized blocks; it also memorizes a “pool” of $m := p \log^2(pL)/ \epsilon^2$ unnoised images $\tilde{x}_1, \tilde{x}_2, \dots, \tilde{x}_m$. When presented with a random seed $z$, the generator computes the block of the partition that $z$ lies in, and then produces the image $\tilde{x}_i \odot z$, where $i$ is the block $z$ belongs to. (See the Figure below.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_construction_2.jpg" width="50%" alt="The bad generator construction" />
</p>
<p>Now we have to prove that such a memorizing generator exists that $\epsilon$-fools all discriminators of capacity $p$. This is shown by the <a href="https://en.wikipedia.org/wiki/Probabilistic_method">probabilistic method</a>: we describe a distribution over generators $G$ that works “in expectation”, and subsequently use concentration bounds to prove there exists at least one generator that does the job.</p>
<p>The distribution on $G$’s is straightforward: we select the pool of (unnoised) images
$\tilde{x}_1, \tilde{x}_2, .., \tilde{x}_m$ at random. Why is this distribution for $G$ sensible? Notice the following simple fact:</p>
<script type="math/tex; mode=display">E_{G} E_{z} D(G(z), z) = E_{\tilde{x}, z} D(\tilde{x} \odot z, z) = E_{x} D(x, E(x)) \hspace{2cm} (3)</script>
<p>In other words, the “expected” encoder correctly matches the expectation of $D(x, E(x))$, so that the discriminator is fooled “in expectation”.
This of course is not enough: we need some kind of concentration argument to show a particular $G$ works against <em>all possible discriminators</em>, which will ultimately use the fact that the discriminator $D$ has a small capacity and small Lipschitz constant. (Think covering number arguments in learning theory.)</p>
<p>Towards that, another useful observation: if $q$ is the uniform distribution over sets $T= {z_1, z_2,\dots, z_m}$, s.t. each $z_i$ is independently sampled from the conditional distribution inside the $i$-th block of the partition of the noise space, by the law of total expectation one can see that
<script type="math/tex">E_{z} D(G(z), z) = E_{T \sim q} \frac{1}{m} \sum_{i=1}^m D(G(z_i), z_i)</script>
The right hand side is an average of terms, each of which is a bounded function of mutually independent random variables – so, by e.g. McDiarmid’s inequality it concentrates around it’s expectation, which by (3) is exactly $E_{z} D(G(z), z)$.</p>
<p>To finish the argument off, we use the fact that due to Lipschitzness and the bound on the number of parameters, the “effective” number of distinct discriminators is small, so we can union bound over them. (Formally, this translates to an epsilon-net + union bound argument. This also gives rise to the value of $m$ used in the construction.)</p>
<h2 id="takeaway">Takeaway</h2>
<p>The result should be interpreted as saying that possibly the theoretical foundations of GANs need more work. The current way of thinking about them as distribution learners may not be the right way to formalize them. Furthermore, one has to take care about transfering notions invented for distribution learning, such as encoders and decoders, over into the GANs setting. Finally there is an empirical question whether any of the <a href="https://deephunt.in/the-gan-zoo-79597dc8c347">myriad GANS variations</a> can avoid mode collapse.</p>
Mon, 12 Mar 2018 03:00:00 -0700
http://offconvex.github.io/2018/03/12/bigan/
http://offconvex.github.io/2018/03/12/bigan/Can increasing depth serve to accelerate optimization?<p>“How does depth help?” is a fundamental question in the theory of deep learning. Conventional wisdom, backed by theoretical studies (e.g. <a href="http://proceedings.mlr.press/v49/eldan16.pdf">Eldan & Shamir 2016</a>; <a href="http://proceedings.mlr.press/v70/raghu17a/raghu17a.pdf">Raghu et al. 2017</a>; <a href="http://proceedings.mlr.press/v65/lee17a/lee17a.pdf">Lee et al. 2017</a>; <a href="http://proceedings.mlr.press/v49/cohen16.pdf">Cohen et al. 2016</a>; <a href="http://proceedings.mlr.press/v65/daniely17a/daniely17a.pdf">Daniely 2017</a>; <a href="https://openreview.net/pdf?id=B1J_rgWRW">Arora et al. 2018</a>), holds that adding layers increases expressive power. But often this expressive gain comes at a price –optimization is harder for deeper networks (viz., <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">vanishing/exploding gradients</a>). Recent works on “landscape characterization” implicitly adopt this worldview (e.g. <a href="https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf">Kawaguchi 2016</a>; <a href="https://openreview.net/pdf?id=ryxB0Rtxx">Hardt & Ma 2017</a>; <a href="http://proceedings.mlr.press/v38/choromanska15.pdf">Choromanska et al. 2015</a>; <a href="http://openaccess.thecvf.com/content_cvpr_2017/papers/Haeffele_Global_Optimality_in_CVPR_2017_paper.pdf">Haeffele & Vidal 2017</a>; <a href="https://arxiv.org/pdf/1605.08361.pdf">Soudry & Carmon 2016</a>; <a href="https://arxiv.org/pdf/1712.08968.pdf">Safran & Shamir 2017</a>). They prove theorems about local minima and/or saddle points in the objective of a deep network, while implicitly assuming that the ideal landscape would be convex (single global minimum, no other critical point). My <a href="https://arxiv.org/pdf/1802.06509.pdf">new paper</a> with Sanjeev Arora and Elad Hazan makes the counterintuitive suggestion that sometimes, increasing depth can <em>accelerate</em> optimization.</p>
<p>Our work can also be seen as one more piece of evidence for a nascent belief that <em>overparameterization</em> of deep nets may be a good thing. By contrast, classical statistics discourages training a model with more parameters than necessary <a href="https://www.rasch.org/rmt/rmt222b.htm">as this can lead to overfitting</a>.</p>
<h2 id="ell_p-regression">$\ell_p$ Regression</h2>
<p>Let’s begin by considering a very simple learning problem - scalar linear regression with $\ell_p$ loss (our theory and experiments will apply to $p>2$):</p>
<script type="math/tex; mode=display">\min_{\mathbf{w}}~L(\mathbf{w}):=\frac{1}{p}\sum_{(\mathbf{x},y)\in{S}}(\mathbf{x}^\top\mathbf{w}-y)^p</script>
<p>$S$ here stands for a training set, consisting of pairs $(\mathbf{x},y)$ where $\mathbf{x}$ is a vector representing an instance and $y$ is a (numeric) scalar standing for its label; $\mathbf{w}$ is the parameter vector we wish to learn. Let’s convert the linear model to an extremely simple “depth-2 network”, by replacing the vector $\mathbf{w}$ with a vector $\mathbf{w_1}$ times a scalar $\omega_2$. Clearly, this is an overparameterization that does not change expressiveness, but yields the (non-convex) objective:</p>
<script type="math/tex; mode=display">\min_{\mathbf{w_1},\omega_2}~L(\mathbf{w_1},\omega_2):=\frac{1}{p}\sum_{(\mathbf{x},y)\in{S}}(\mathbf{x}^\top\mathbf{w_1}\omega_2-y)^p</script>
<p>We show in the paper, that if one applies gradient descent over $\mathbf{w_1}$ and $\omega_2$, with small learning rate and near-zero initialization (as customary in deep learning), the induced dynamics on the overall (<em>end-to-end</em>) model $\mathbf{w}=\mathbf{w_1}\omega_2$ can be written as follows:</p>
<script type="math/tex; mode=display">\mathbf{w}^{(t+1)}\leftarrow\mathbf{w}^{(t)}-\rho^{(t)}\nabla{L}(\mathbf{w}^{(t)})-\sum_{\tau=1}^{t-1}\mu^{(t,\tau)}\nabla{L}(\mathbf{w}^{(\tau)})</script>
<p>where $\rho^{(t)}$ and $\mu^{(t,\tau)}$ are appropriately defined (time-dependent) coefficients.
Thus the seemingly benign addition of a single multiplicative scalar turned plain gradient descent into a scheme that somehow has a memory of past gradients —the key feature of <a href="https://distill.pub/2017/momentum/">momentum</a> methods— as well as a time-varying learning rate. While theoretical analysis of the precise benefit of momentum methods is never easy, a simple experiment with $p=4$, on <a href="https://archive.ics.uci.edu/ml/index.php">UCI Machine Learning Repository</a>’s <a href="https://archive.ics.uci.edu/ml/datasets/gas+sensor+array+drift+dataset">“Gas Sensor Array Drift at Different Concentrations” dataset</a>, shows the following effect:</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_L4_exp.png" width="40%" alt="L4 regression experiment" />
</p>
<p>Not only did the overparameterization accelerate gradient descent, but it has done so more than two well-known, explicitly designed acceleration methods – <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">AdaGrad</a> and <a href="https://arxiv.org/pdf/1212.5701.pdf">AdaDelta</a> (the former did not really provide a speedup in this experiment). We observed similar speedups in other settings as well.</p>
<p>What is happening here? Can non-convex objectives corresponding to deep networks be easier to optimize than convex ones?
Is this phenomenon common or is it limited to toy problems as above?
We take a first crack at addressing these questions…</p>
<h2 id="overparameterization-decoupling-optimization-from-expressiveness">Overparameterization: Decoupling Optimization from Expressiveness</h2>
<p>A general study of the effect of depth on optimization entails an inherent difficulty - deeper networks may seem to converge faster due to their superior expressiveness.
In other words, if optimization of a deep network progresses more rapidly than that of a shallow one, it may not be obvious whether this is a result of a true acceleration phenomenon, or simply a byproduct of the fact that the shallow model cannot reach the same loss as the deep one.
We resolve this conundrum by focusing on models whose representational capacity is oblivious to depth - <em>linear neural networks</em>, the subject of many recent studies.
With linear networks, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices - an overparameterization.
Accordingly, if this leads to accelerated convergence, one can be certain that it is not an outcome of any phenomenon other than favorable properties of depth for optimization.</p>
<h2 id="implicit-dynamics-of-depth">Implicit Dynamics of Depth</h2>
<p>Suppose we are interested in learning a linear model parameterized by a matrix $W$, through minimization of some training loss $L(W)$.
Instead of working directly with $W$, we replace it by a depth $N$ linear neural network, i.e. we overparameterize it as $W=W_{N}W_{N-1}\cdots{W_1}$, with $W_j$ being weight matrices of individual layers.
In the paper we show that if one applies gradient descent over $W_{1}\ldots{W}_N$, with small learning rate $\eta$, and with the condition:</p>
<script type="math/tex; mode=display">W_{j+1}^\top W_{j+1} = W_j W_j^\top</script>
<p>satisfied at optimization commencement (note that this approximately holds with standard near-zero initialization), the dynamics induced on the overall end-to-end mapping $W$ can be written as follows:</p>
<script type="math/tex; mode=display">W^{(t+1)}\leftarrow{W}^{(t)}-\eta\sum_{j=1}^{N}\left[W^{(t)}(W^{(t)})^\top\right]^\frac{j-1}{N}\nabla{L}(W^{(t)})\left[(W^{(t)})^\top{W}^{(t)}\right]^\frac{N-j}{N}</script>
<p>We validate empirically that this analytically derived update rule (over classic linear model) indeed complies with deep network optimization, and take a series of steps to theoretically interpret it.
We find that the transformation applied to the gradient $\nabla{L}(W)$ (multiplication from the left by $[WW^\top]^\frac{j-1}{N}$, and from the right by $[W^\top{W}]^\frac{N-j}{N}$, followed by summation over $j$) is a particular preconditioning scheme, that promotes movement along directions already taken by optimization.
More concretely, the preconditioning can be seen as a combination of two elements:</p>
<ul>
<li>an adaptive learning rate that increases step sizes away from initialization; and</li>
<li>a “momentum-like” operation that stretches the gradient along the azimuth taken so far.</li>
</ul>
<p>An important point to make is that the update rule above, referred to hereafter as the <em>end-to-end update rule</em>, does not depend on widths of hidden layers in the linear neural network, only on its depth ($N$).
This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect - it is only the number of layers that matters.
Therefore, acceleration by depth need not be computationally demanding - a fact we clearly observe in our experiments (previous figure for example shows acceleration by orders of magnitude at the price of a single extra scalar parameter).</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_update_rule.png" width="60%" alt="End-to-end update rule" />
</p>
<h2 id="beyond-regularization">Beyond Regularization</h2>
<p>The end-to-end update rule defines an optimization scheme whose steps are a function of the gradient $\nabla{L}(W)$ and the parameter $W$.
As opposed to many acceleration methods (e.g. <a href="https://distill.pub/2017/momentum/">momentum</a> or <a href="https://arxiv.org/pdf/1412.6980.pdf">Adam</a>) that explicitly maintain auxiliary variables, this scheme is memoryless, and by definition born from gradient descent over something (overparameterized objective).
It is therefore natural to ask if we can represent the end-to-end update rule as gradient descent over some regularization of the loss $L(W)$, i.e. over some function of $W$.
We prove, somewhat surprisingly, that the answer is almost always negative - as long as the loss $L(W)$ does not have a critical point at $W=0$, the end-to-end update rule, i.e. the effect of overparameterization, cannot be attained via <em>any</em> regularizer.</p>
<h2 id="acceleration">Acceleration</h2>
<p>So far, we analyzed the effect of depth (in the form of overparameterization) on optimization by presenting an equivalent preconditioning scheme and discussing some of its properties.
We have not, however, provided any theoretical evidence in support of acceleration (faster convergence) resulting from this scheme.
Full characterization of the scenarios in which there is a speedup goes beyond the scope of our paper.
Nonetheless, we do analyze a simple $\ell_p$ regression problem, and find that whether or not increasing depth accelerates depends on the choice of $p$:
for $p=2$ (square loss) adding layers does not lead to a speedup (in accordance with previous findings by <a href="https://arxiv.org/pdf/1312.6120.pdf">Saxe et al. 2014</a>);
for $p>2$ it can, and this may be attributed to the preconditioning scheme’s ability to handle large plateaus in the objective landscape.
A number of experiments, with $p$ equal to 2 and 4, and depths ranging between 1 (classic linear model) and 8, support this conclusion.</p>
<h2 id="non-linear-experiment">Non-Linear Experiment</h2>
<p>As a final test, we evaluated the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting - the <a href="https://github.com/tensorflow/models/tree/master/tutorials/image/mnist">convolutional network tutorial for MNIST built into TensorFlow</a>.
We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer.
With an addition of roughly 15% in number of parameters, optimization accelerated by orders of magnitude:</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_cnn_exp.png" width="40%" alt="TensorFlow MNIST CNN experiment" />
</p>
<p>We note that similar experiments on other convolutional networks also gave rise to a speedup, but not nearly as prominent as the above.
Empirical characterization of conditions under which overparameterization accelerates optimization in non-linear settings is potentially an interesting direction for future research.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Our work provides insight into benefits of depth in the form of overparameterization, from the perspective of optimization.
Many open questions and problems remain.
For example, is it possible to rigorously analyze the acceleration effect of the end-to-end update rule (analogously to, say, <a href="http://www.cis.pku.edu.cn/faculty/vision/zlin/1983-A%20Method%20of%20Solving%20a%20Convex%20Programming%20Problem%20with%20Convergence%20Rate%20O(k%5E(-2))_Nesterov.pdf">Nesterov 1983</a> or <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">Duchi et al. 2011</a>)?
Treatment of non-linear deep networks is of course also of interest, as well as more extensive empirical evaluation.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Fri, 02 Mar 2018 05:00:00 -0800
http://offconvex.github.io/2018/03/02/acceleration-overparameterization/
http://offconvex.github.io/2018/03/02/acceleration-overparameterization/Proving generalization of deep nets via compression<p>This post is about <a href="https://arxiv.org/abs/1802.05296">my new paper with Rong Ge, Behnam Neyshabur, and Yi Zhang</a> which offers some new perspective into the generalization mystery for deep nets discussed in
<a href="http://www.offconvex.org/2017/12/08/generalization1/">my earlier post</a>. The new paper introduces an elementary compression-based framework for proving generalization bounds. It shows that deep nets are highly noise stable, and consequently, compressible. The framework also gives easy proofs (sketched below) of some papers that appeared in the past year.</p>
<p>Recall that the <strong>basic theorem</strong> of generalization theory says something like this: if training set had $m$ samples then the <em>generalization error</em> —defined as the difference between error on training data and test data (aka held out data)— is of the order of $\sqrt{N/m}$. Here $N$ is the number of <em>effective parameters</em> (or <em>complexity measure</em>) of the net; it is at most the actual number of trainable parameters but could be much less. (For ease of exposition this post will ignore nuisance factors like $\log N$ etc. which also appear in the these calculations.) The mystery is that networks with millions of parameters have low generalization error even when $m =50K$ (as in CIFAR10 dataset), which suggests that the number of true parameters is actually much less than $50K$. The papers <a href="https://arxiv.org/abs/1706.08498">Bartlett et al. NIPS’17</a> and <a href="https://openreview.net/forum?id=Skz_WfbCZ">Neyshabur et al. ICLR’18</a>
try to quantify the complexity measure using very interesting ideas like Pac-Bayes and Margin (which influenced our paper). But ultimately the quantitative estimates are fairly vacuous —orders of magnitude <em>more</em> than the number of <em>actual parameters.</em> By contrast our new estimates are several orders of magnitude better, and on the verge of being interesting. See the following bar graph on a log scale. (All bounds are listed ignoring “nuisance factors.” Number of trainable parameters is included only to indicate scale.)</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/acompare.png" width="75%" alt="comparison of bounds from various recent papers" />
</p>
<h2 id="the-compression-approach">The Compression Approach</h2>
<p>The compression approach takes a deep net $C$ with $N$ trainable parameters and tries to compress it to another one $\hat{C}$ that has (a) much fewer parameters $\hat{N}$ than $C$ and (b) has roughly the same training error as $C$.</p>
<p>Then the above basic theorem guarantees that so long as the number of training samples exceeds $\hat{N}$, then $\hat{C}$ <em>does</em> generalize well (even if $C$ doesn’t). An extension of this approach says that the same conclusions holds if we let the compression algorithm to depend upon an arbitrarily long <em>random string</em> provided this string is fixed in advance of seeing the training data. We call this <em>compression with respect to fixed string</em> and rely upon it.</p>
<p>Note that the above approach proves good generalization of the compressed $\hat{C}$, not the original $C$. (I suspect the ideas may extend to proving good generalization of the original $C$; the hurdles seem technical rather than inherent.) Something similar was true of earlier approaches using PAC-Bayes bounds, which also prove the generalization of some net related to $C$, not of $C$ itself. (Hence the tongue-in-cheek title of the classic reference <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford-Caruana2002</a>.)</p>
<p>Of course, in practice deep nets are well-known to be compressible using a slew of ideas—by factors of 10x to 100x; see <a href="https://arxiv.org/abs/1710.09282">the recent survey</a>. However, usually such compression involves <em>retraining</em> the compressed net. Our paper doesn’t consider retraining the net (since it involves reasoning about the loss landscape) but followup work should look at this.</p>
<h2 id="flat-minima-and-noise-stability">Flat minima and Noise Stability</h2>
<p>Modern generalization results can be seen as proceeding via some formalization of a <em>flat minimum</em> of the loss landscape. This was suggested in 1990s as the source of good generalization <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">Hochreiter and Schmidhuber 1995</a>. Recent empirical work of <a href="https://arxiv.org/abs/1609.04836">Keskar et al 2016</a> on modern deep architectures finds that flatness does correlate with better generalization, though the issue is complicated, as discussed in an upcoming post by Behnam Neyshabur.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aflatminima.png" width="65%" alt="Flat vs sharp minima" />
</p>
<p>Here’s the intuition why a flat minimum should generalize better, as originally articulated by <a href="http://www.cs.toronto.edu/~fritz/absps/colt93.pdf">Hinton and Camp 1993</a>. Crudely speaking, suppose a flat minimum is one that occupies “volume” $\tau$ in the landscape. (The flatter the minimum, the higher $\tau$ is.) Then the number of <em>distinct</em> flat minima in the landscape is at most $S =\text{total volume}/\tau$. Thus one can number the flat minima from $1$ to $S$, implying that a flat minimum can be represented using $\log S$ bits. The above-mentioned <em>basic theorem</em> implies that flat minima generalize if the number of training samples $m$ exceeds $\log S$.</p>
<p>PAC-Bayes approaches try to formalize the above intuition by defining a flat minimum as follows: it is a net $C$ such that adding appropriately-scaled gaussian noise to all its trainable parameters does not greatly affect the training error. This allows quantifying the “volume” above in terms of probability/measure (see
<a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">my lecture notes</a> or <a href="https://arxiv.org/abs/1703.11008">Dziugaite-Roy</a>) and yields some explicit estimates on sample complexity. However, obtaining good quantitative estimates from this calculation has proved difficut, as seen in the bar graph earlier.</p>
<p>We formalize “flat minimum” using noise stability of a slightly different form. Roughly speaking, it says that if we inject appropriately scaled gaussian noise at the output of some layer, then this noise gets attenuated as it propagates up to higher layers. (Here “top” direction refers to the output of the net.) This is obviously related to notions like dropout, though it arises also in nets that are not trained with dropout. The following figure illustrates how noise injected at a certain layer of VGG19 (trained on CIFAR10) affects the higher layer. The y-axis denote the magnitude of the noise ($\ell_2$ norm) as a multiple of the vector being computed at the layer, and shows how a single noise vector quickly attenuates as it propagates up the layers.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/attenuate.jpg" width="65%" alt="How noises attenuates as it travels up the layers of VGG." />
</p>
<p>Clearly, computation of the trained net is highly resistant to noise. (This has obvious implications for biological neural nets…)
Note that the training involved no explicit injection of noise (eg dropout). Of course, stochastic gradient descent <em>implicitly</em> adds noise to the gradient, and it would be nice to investigate more rigorously if the noise stability arises from this or from some other source.</p>
<h2 id="noise-stability-and-compressibility-of-single-layer">Noise stability and compressibility of single layer</h2>
<p>To understand why noise-stable nets are compressible, let’s first understand noise stability for a single layer in the net, where we ignore the nonlinearity. Then this layer is just a linear transformation, i.e., matrix $M$.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/alinear.png" width="40%" alt="matrix M describing a single layer" />
</p>
<p>What does it mean that this matrix’s output is stable to noise? Suppose the vector at the previous layer is a unit vector $x$. This is the output of the lower layers on an actual sample, so $x$ can be thought of as the “signal” for the current layer. The matrix converts $x$ into $Mx$. If we inject a noise vector $\eta$ of unit norm at $x$ then the output must become $M(x +\eta)$. We say $M$ is noise stable for input $x$ if such noising affects the output very little, which implies the norm of $Mx$ is much higher than that of $M \eta$.
The former is at most $\sigma_{max}(M)$, the largest singular value of $M$. The latter is approximately
$(\sum_i \sigma_i(M)^2)^{1/2}/\sqrt{h}$ where $\sigma_i(M)$ is the $i$th singular value of $M$ and $h$ is dimension of $Mx$. The reason is that gaussian noise divides itself evenly across all directions, with variance in each direction $1/h$.
We conclude that:
<script type="math/tex">(\sigma_{max}(M))^2 \gg \frac{1}{h} \sum_i (\sigma_i(M)^2),</script></p>
<p>which implies that the matrix has an uneven distribution of singular values. Ratio of left side and right side is called the <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> and is at most the linear algebraic rank. Furthermore, the above analysis suggests that the “signal” $x$ is <em>correlated</em> with the singular directions corresponding to the higher singular values, which is at the root of the noise stability.</p>
<p>Our experiments on VGG and GoogleNet reveal that the higher layers of deep nets—where most of the net’s parameters reside—do indeed exhibit a highly uneven distribution of singular values, and that the signal aligns more with the higher singular directions. The figure below describes layer 10 in VGG19 trained on CIFAR10.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aspectrumlayer10.png" width="45%" alt="distribution of singular values of matrix at layer 10 of VGG19" />
</p>
<h2 id="compressing-multilayer-net">Compressing multilayer net</h2>
<p>The above analysis of noise stability in terms of singular values cannot hold across multiple layers of a deep net, because the mapping becomes nonlinear, thus lacking a notion of singular values. Noise stability is therefore formalized using the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian</a> of this mapping, which is the matrix describing how the output reacts to tiny perturbations of the input. Noise stability says that this nonlinear mapping passes signal (i.e., the vector from previous layers) much more strongly than it does a noise vector.</p>
<p>Our compression algorithm applies a randomized transformation to the matrix of each layer (aside: note the use of randomness, which fits in our “compressing with fixed string” framework) that relies on the low stable rank condition at each layer. This compression introduces error in the layer’s output, but the vector describing this error is “gaussian-like” due to the use of randomness in the compression. Thus this error gets attenuated by higher layers.</p>
<p>Details can be found in the paper. All noise stability properties formalized there are later checked in the experiments section.</p>
<h2 id="simpler-proofs-of-existing-generalization-bounds">Simpler proofs of existing generalization bounds</h2>
<p>In the paper we also use our compression framework to give elementary (say, 1-page) proofs of the previous generalization bounds from the past year. For example, the paper of <a href="https://openreview.net/forum?id=Skz_WfbCZ">Neyshabur et al.</a> shows the following is an upper bound on the generalization error where $A_i$ is the matrix describing the $i$th layer.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aexpression1.png" width="50%" alt="Expression for effective number of parameters in Neyshabur et al" />
</p>
<p>Comparing to the <em>basic theorem</em>, we realize the numerator corresponds to the number of effective parameters. The second part of the expression is the sum of stable ranks of the layer matrices, and is a natural measure of complexity. The first part is product of spectral norms (= top singular value) of the layer matrices, which happens to be an upper bound on the Lipschitz constant of the entire network. (Lipschitz constant of a mapping $f$ in this context is a constant $L$ such that $f(x) \leq L c\dot |x|$.)
The reason this is the Lipschitz constant is that if an input $x$ is presented at the bottom of the net, then each successive layer can multiply its norm by at most the top singular value, and the ReLU nonlinearity can only decrease norm since its only action is to zero out some entries.</p>
<p>Having decoded the above expression, it is clear how to interpret it as an analysis of a (deterministic) compression of the net. Compress each layer by zero-ing out (in the <a href="https://en.wikipedia.org/wiki/Singular-value_decomposition">SVD</a>) singular values less than some threshold $t|A|$, which we hope turns it into a low rank matrix. (Recall that a matrix with rank $r$ can be expressed using $2nr$ parameters.) A simple computation shows that the number of remaining singular values is at most the stable rank divided by $t^2$. How do we set $t$? The truncation introduces error in the layer’s computation, which gets propagated through the higher layers and magnified at most by the Lipschitz constant. We want to make this propagated error small, which can be done by making $t$ inversely proportional to the Lipschitz constant. This leads to the above bound on the number of effective parameters.</p>
<p>This proof sketch also clarifies how our work improves upon the older works: they are also (implicitly) compressing the deep net, but their analysis of how much compression is possible is much more pessimistic because they assume the network transmits noise at peak efficiency given by the Lipschitz constant.</p>
<h2 id="extending-the-ideas-to-convolutional-nets">Extending the ideas to convolutional nets</h2>
<p>Convolutional nets could not be dealt with cleanly in the earlier papers. I must admit that handling convolution stumped us as too for a while. A layer in a convolutional net applies the same filter to all patches in that layer. This <em>weight sharing</em> means that the full layer matrix already has a fairly compact representation, and it seems challenging to compress this further. However, in nets like VGG and GoogleNet, the higher layers use rather large filter matrices (i.e., they use a large number of channels), and one could hope to compress these individual filter matrices.</p>
<p>Let’s discuss the two naive ideas. The first is to compress the filter independently in different patches. This unfortunately is not a compression at all, since each copy of the filter then comes with its own parameters. The second idea is to do a single compression of the filter and use the compressed copy in each patch. This messes up the error analysis because the errors introduced due to compression in the different copies are now correlated, whereas the analysis requires them to be more like gaussian.</p>
<p>The idea we end up using is to compress the filters using $k$-wise independence (an idea from <a href="https://en.wikipedia.org/wiki/K-independent_hashing">theory of hashing schemes</a>), where $k$ is roughly logarithmic in the number of training samples.</p>
<h2 id="concluding-thoughts">Concluding thoughts</h2>
<p>While generalization theory can seem merely academic at times —since in practice held-out data establishes generalizaton— I hope you see from the above account that understanding generalization can give some interesting insights into what is going on in deep net training. Insights about noise stability of trained deep nets have obvious interest for study of biological neural nets. (See also the classic <a href="http://fab.cba.mit.edu/classes/862.16/notes/computation/vonNeumann-1956.pdf">von Neumann ideas</a> on noise resilient computation.)</p>
<p>At the same time, I suspect that compressibility is only one part of the generalization mystery, and that we are still missing some big idea. I don’t see how to use the above ideas to demonstrate that the effective number of parameters in VGG19 is as low as $50k$, as seems to be the case. I suspect doing so will force us to understand the structure of the data (in this case, real-life images) which the above analysis mostly ignores. The only property of data used is that the deep net aligns itself better with data than with noise.</p>
Sat, 17 Feb 2018 08:00:00 -0800
http://offconvex.github.io/2018/02/17/generalization2/
http://offconvex.github.io/2018/02/17/generalization2/Generalization Theory and Deep Nets, An introduction<p>Deep learning holds many mysteries for theory, as we have discussed on this blog. Lately many ML theorists have become interested in the generalization mystery: why do trained deep nets perform well on previously unseen data, even though they have way more free parameters than the number of datapoints (the classic “overfitting” regime)? Zhang et al.’s paper <a href="https://arxiv.org/abs/1611.03530">Understanding Deep Learning requires Rethinking Generalization</a> played some role in bringing attention to this challenge. Their main experimental finding is that if you take a classic convnet architecture, say <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">Alexnet</a>, and train it on images with random labels, then you can still achieve very high accuracy on the training data. (Furthermore, usual regularization strategies, which are believed to promote better generalization, do not help much.) Needless to say, the trained net is subsequently unable to predict the (random) labels of still-unseen images, which means it doesn’t generalize. The paper notes that the ability to fit a classifier to data with random labels is also a traditional measure in machine learning called Rademacher complexity (which we will discuss shortly) and thus Rademacher complexity gives no meaningful bounds on sample complexity. I found this paper entertainingly written and recommend reading it, despite having given away the punchline. Congratulations to the authors for winning best paper at ICLR 2017.</p>
<p>But I would be remiss if I didn’t report that at the <a href="https://simons.berkeley.edu/programs/machinelearning2017">Simons Institute Semester on theoretical ML in spring 2017</a> generalization theory experts expressed unhappiness about this paper, and especially its title. They felt that similar issues had been extensively studied in context of simpler models such as kernel SVMs (which, to be fair, is clearly mentioned in the paper). It is trivial to design SVM architectures with high Rademacher complexity which nevertheless train and generalize well on real-life data. Furthermore, theory was developed to explain this generalization behavior (and also for related models like boosting). On a related note, several earlier papers of Behnam Neyshabur and coauthors (see <a href="https://arxiv.org/abs/1605.07154">this paper</a> and for a full account, <a href="https://arxiv.org/abs/1703.11008">Behnam’s thesis</a>)
had made points fairly similar to Zhang et al. pertaining to deep nets.</p>
<p>But regardless of such complaints, we should be happy about the attention brought by Zhang et al.’s paper to a core theory challenge. Indeed, the passionate discussants at the Simons semester themselves banded up in subgroups to address this challenge: these resulted in papers by <a href="https://arxiv.org/abs/1703.11008">Dzigaite and Roy</a>, then <a href="https://arxiv.org/abs/1706.08498">Bartlett, Foster, and Telgarsky</a> and finally <a href="https://arxiv.org/abs/1707.09564">Neyshabur, Bhojapalli, MacAallester, Srebro</a>. (The latter two were presented at NIPS’17 this week.)</p>
<p>Before surveying these results let me start by suggesting that some of the controversy over the title of Zhang et al.’s paper stems from some basic confusion about whether or not current generalization theory is prescriptive or merely descriptive. These confusions arise from the standard treatment of generalization theory in courses and textbooks, as I discovered while teaching the recent developments in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/">my graduate seminar</a>.</p>
<h3 id="prescriptive-versus-descriptive-theory">Prescriptive versus descriptive theory</h3>
<p>To illustrate the difference, consider a patient who says to his doctor: “Doctor, I wake up often at night and am tired all day.”</p>
<blockquote>
<p>Doctor 1 (without any physical examination): “Oh, you have sleep disorder.”</p>
</blockquote>
<p>I call such a diagnosis <em>descriptive</em>, since it only attaches a label to the patient’s problem, without giving any insight into how to solve the problem. Contrast with:</p>
<blockquote>
<p>Doctor 2 (after careful physical examination): “A growth in your sinus is causing sleep apnea. Removing it will resolve your problems.”</p>
</blockquote>
<p>Such a diagnosis is <em>prescriptive.</em></p>
<h2 id="generalization-theory-descriptive-or-prescriptive">Generalization theory: descriptive or prescriptive?</h2>
<p>Generalization theory notions such as VC dimension, Rademacher complexity, and PAC-Bayes bound, consist of attaching a <em>descriptive label</em> to the basic phenomenon of lack of generalization. They are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems.</p>
<p>Recall what it means for a hypothesis/classifier $h$ to not generalize. Assume the training data consists of a sample $S = {(x_1, y_1), (x_2, y_2),\ldots, (x_m, y_m)}$ of $m$ examples from some distribution ${\mathcal D}$. A <em>loss function</em> $\ell$ describes how well hypothesis $h$ classifies a datapoint: the loss $\ell(h, (x, y))$ is high if the hypothesis didn’t come close to producing the label $y$ on $x$ and low if it came close. (To give an example, the <em>regression</em> loss is $(h(x) -y)^2$.) Now let us denote by $\Delta_S(h)$ the average loss on samplepoints in $S$, and by $\Delta_{\mathcal D}(h)$ the expected loss on samples from distribution ${\mathcal D}$.
Training <em>generalizes</em> if the hypothesis $h$ that minimises $\Delta_S(h)$ for a random sample $S$ also achieves very similarly low loss $\Delta_{\mathcal D}(h)$ on the full distribution. When this fails to happen, we have:</p>
<blockquote>
<p><strong>Lack of generalization:</strong> $\Delta_S(h) \ll \Delta_{\mathcal D}(h) \qquad (1). $</p>
</blockquote>
<p>In practice, lack of generalization is detected by taking a second sample
(“held out set”) $S_2$ of size $m$ from ${\mathcal D}$. By concentration bounds expected loss of $h$ on this second sample closely approximates $\Delta_{\mathcal D}(h)$, allowing us to conclude</p>
<script type="math/tex; mode=display">\Delta_S(h) - \Delta_{S_2}(h) \ll 0 \qquad (2).</script>
<h3 id="generalization-theory-descriptive-parts">Generalization Theory: Descriptive Parts</h3>
<p>Let’s discuss <strong>Rademacher complexity,</strong> which I will simplify a bit for this discussion. (See also <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes of my lecture</a>.) For convenience assume in this discussion that labels and loss are $0,1$, and
assume that the badly generalizing $h$ predicts perfectly on the training sample $S$ and is completely wrong on the heldout set $S_2$, meaning</p>
<p>$\Delta_S(h) - \Delta_{S_2}(h) \approx - 1 \qquad (3)$</p>
<p>Rademacher complexity concerns the following thought experiment. Take a single sample of size $2m$ from $\mathcal{D}$, split it into two and call the first half $S$ and the second $S_2$. <em>Flip</em> the labels of points in $S_2$. Now try to find a classifier $C$ that best describes this new sample, meaning one that minimizes $\Delta_S(h) + 1- \Delta_{S_2}(h)$. This expression follows since flipping the label of a point turns good classification into bad and vice versa, and thus the loss function for $S_2$ is $1$ minus the old loss. We say the class of classifiers has high Rademacher complexity if with high probability this quantity is small, say close to $0$.</p>
<p>But a glance at (3) shows that it implies high Rademacher complexity: $S, S_2$ were random samples of size $m$ from $\mathcal{D}$, so their combined size is $2m$, and when generalization failed we succeeded in finding a hypothesis $h$ for which $\Delta_S(h) + 1- \Delta_{S_2}(h)$ is very small.</p>
<p>In other words, returning to our medical analogy, the doctor only had to hear “Generalization didn’t happen” to pipe up with: “Rademacher complexity is high.” This is why I call this result descriptive.</p>
<p>The <strong>VC dimension</strong> bound is similarly descriptive. VC dimension is defined to be at least $k +1$ if there exists a set of size $k$ such that the following is true. If we look at all possible classifiers in the class, and the sequence of labels each gives to the $k$ datapoints in the sample, then we can find all possible $2^{k}$ sequences of $0$’s and $1$’s.</p>
<p>If generalization does not happen as in (2) or (3) then this turns out to imply that VC dimension is at least around $\epsilon m$ for some $\epsilon >0$. The reason is that the $2m$ data points were split randomly into $S, S_2$, and there are $2^{2m}$ such splittings. When the generalization error is $\Omega(1)$ this can be shown to imply that we can achieve $2^{\Omega(m)}$ labelings of the $2m$ datapoints using all possible classifiers. Now the classic Sauer’s lemma (see any lecture notes on this topic, such as <a href="https://www.cs.princeton.edu/courses/archive/spring14/cos511/scribe_notes/0220.pdf">Schapire’s</a>) can be used to show that
VC dimension is at least $\epsilon m/\log m$ for some constant $\epsilon>0$.</p>
<p>Thus again, the doctor only has to hear “Generalization didn’t happen with sample size $m$” to pipe up with: “VC dimension is higher than $\Omega(m/log m)$.”</p>
<p>One can similarly show that PAC-Bayes bounds are also descriptive, as you can see in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes from my lecture</a>.</p>
<blockquote>
<p>Why do students get confused and think that such tools of generalization theory gives some powerful technique to guide design of machine learning algorithms?</p>
</blockquote>
<p>Answer: Probably because standard presentation in lecture notes and textbooks seems to pretend that we are computationally-omnipotent beings who can <em>compute</em> VC dimension and Rademacher complexity and thus arrive at meaningful bounds on sample sizes needed for training to generalize. While this may have been possible in the old days with simple classifiers, today we have
complicated classifiers with millions of variables, which furthermore are products of nonconvex optimization techniques like backpropagation.
The only way to actually lowerbound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a held-out set. Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al. for highlighting that theory currently offers nothing better.</p>
<h2 id="toward-a-prescriptive-generalization-theory-the-new-papers">Toward a prescriptive generalization theory: the new papers</h2>
<p>In our medical analogy we saw that the doctor needs to at least do a physical examination to have a prescriptive diagnosis. The authors of the new papers intuitively grasp this point, and try to identify properties of real-life deep nets that may lead to better generalization. Such an analysis (related to “margin”) was done for simple 2-layer networks couple decades ago, and the challenge is to find analogs for multilayer networks. Both Bartlett et al. and Neyshabur et al. hone in on <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> of the weight matrices of the layers of the deep net. These can be seen as an instance of a “flat minimum” which has been discussed in <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">neural nets literature</a> for many years. I will present my take on these results as well as some improvements in a future post. Note that these methods do not as yet give any nontrivial bounds on the number of datapoints needed for training the nets in question.</p>
<p><a href="https://arxiv.org/abs/1703.11008">Dziugaite and Roy</a> take a slightly different tack. They start with McAllester’s 1999 PAC-Bayes bound, which says that if the algorithm’s prior distribution on the hypotheses is $P$ then for every posterior distributions $Q$ (which could depend on the data) on the hypotheses the generalization error of the average classifier picked according to $Q$ is upper bounded as follows where $D()$ denotes <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>:</p>
<div style="text-align:center;">
<img style="width:600px;" src="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/pacbayes.png" />
</div>
<p>This allows upperbounds on generalization error (specifically, upperbounds on number of samples that guarantee such an upperbound) by proceeding as in <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford and Caruana’s old paper</a> where $P$ is a uniform gaussian, and $Q$ is a noised version of the trained deep net (whose generalization we are trying to explain). Specifically, if $w_{ij}$ is the weight of edge ${i, j}$ in the trained net, then $Q$ consists of adding a gaussian noise $\eta_{ij}$ to weight $w_{ij}$. Thus a random classifier according to $Q$ is nothing but a noised version of the trained net. Now we arrive at the crucial idea: Use nonconvex optimization to find a choice for the variance of $\eta_{ij}$ that balances two competing criteria: (a) the average classifier drawn from $Q$ has training error not much more than the original trained net (again, this is a quantification of the “flatness” of the minimum found by the optimization) and (b) the right hand side of the above expression is as small as possible. Assuming (a) and (b) can be suitably bounded, it follows that the average classifier from Q works reasonably well on unseen data. (Note that this method only proves generalization of a noised version of the trained classifier.)</p>
<p>Applying this method on simple fully-connected neural nets trained on MNIST dataset, they can prove that the method achieves error $17$ percent error on MNIST (whereas the <em>actual</em> error is much lower at 2-3 percent). Hence the title of their paper, which promises <em>nonvacuous generalization bounds.</em> What I find most interesting about this result is that it uses the power of nonconvex optimization (harnessed above to find a suitable noised distribution $Q$) to cast light on one of the metaquestions about nonconvex optimization, namely, why does deep learning not overfit!</p>
Fri, 08 Dec 2017 10:00:00 -0800
http://offconvex.github.io/2017/12/08/generalization1/
http://offconvex.github.io/2017/12/08/generalization1/How to Escape Saddle Points Efficiently<p>A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s</a> blog posts), the critical open problem is one of <strong>efficiency</strong> — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe <a href="https://arxiv.org/abs/1703.00887">our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade</a>, that provides the first provable <em>positive</em> answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!</p>
<h2 id="perturbing-gradient-descent">Perturbing Gradient Descent</h2>
<p>We are in the realm of classical gradient descent (GD) — given a function $f:\mathbb{R}^d \to \mathbb{R}$ we aim to minimize the function by moving in the direction of the negative gradient:</p>
<script type="math/tex; mode=display">x_{t+1} = x_t - \eta \nabla f(x_t),</script>
<p>where $x_t$ are the iterates and $\eta$ is the step size. GD is well understood theorietically in the case of convex optimization, but the general case of nonconvex optimization has been far less studied. We know that GD converges quickly to the neighborhood of stationary points (points where $\nabla f(x) = 0$) in the nonconvex setting, but these stationary points may be local minima or, unhelpfully, local maxima or saddle points.</p>
<p>Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:</p>
<ol>
<li>
<p><strong>Intermittent Perturbations</strong>: <a href="http://arxiv.org/abs/1503.02101">Ge, Huang, Jin and Yuan 2015</a> considered adding occasional random perturbations to GD, and were able to provide the first <em>polynomial time</em> guarantee for GD to escape saddle points. (See also <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s post</a> )</p>
</li>
<li>
<p><strong>Random Initialization</strong>: <a href="http://arxiv.org/abs/1602.04915">Lee et al. 2016</a> showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s post</a>)</p>
</li>
</ol>
<p>Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.</p>
<p>One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.</p>
<p><strong><em>Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?</em></strong></p>
<p>A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably <em>inefficient</em> in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).</p>
<p>Somewhat surprisingly, it turns out that we obtain a rather different — and <em>positive</em> — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:</p>
<blockquote>
<p><strong>Perturbed gradient descent (PGD)</strong></p>
<ol>
<li><strong>for</strong> $~t = 1, 2, \ldots ~$ <strong>do</strong></li>
<li>$\quad\quad x_{t} \leftarrow x_{t-1} - \eta \nabla f (x_{t-1})$</li>
<li>$\quad\quad$ <strong>if</strong> $~$<em>perturbation condition holds</em>$~$ <strong>then</strong></li>
<li>$\quad\quad\quad\quad x_t \leftarrow x_t + \xi_t$</li>
</ol>
</blockquote>
<p>Here the perturbation $\xi_t$ is sampled uniformly from a ball centered at zero with a suitably small radius, and is added to the iterate when the gradient is suitably small. These particular choices are made for analytic convenience; we do not believe that uniform noise is necessary. nor do we believe it essential that noise be added only when the gradient is small.</p>
<h2 id="strict-saddle-and-second-order-stationary-points">Strict-Saddle and Second-order Stationary Points</h2>
<p>We define <em>saddle points</em> in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along <em>at least one direction</em>. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\lambda_{\min}(\nabla^2 f(x)) \begin{cases}
> 0 \quad\quad \text{local minimum} \\
= 0 \quad\quad \text{local minimum or saddle point} \\
< 0 \quad\quad \text{saddle point}
\end{cases} %]]></script>
<p>We further call the saddle points in the last category, where $\lambda_{\min}(\nabla^2 f(x)) < 0$, <strong>strict saddle points</strong>.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/strictsaddle.png" width="85%" alt="Strict and Non-strict Saddle Point" />
</p>
<p>While non-strict saddle points can be flat in the valley, strict saddle points require that there is <em>at least one direction</em> along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is <em>NP-hard</em>; therefore, we — and previous authors — focus on escaping <em>strict</em> saddle points.</p>
<p>Formally, we make the following two standard assumptions regarding smoothness.</p>
<blockquote>
<p><strong>Assumption 1</strong>: $f$ is $\ell$-gradient-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2, |\nabla f(x_1) - \nabla f(x_2)| \le \ell |x_1 - x_2|$. <br />
$~$<br />
<strong>Assumption 2</strong>: $f$ is $\rho$-Hessian-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2$, $|\nabla^2 f(x_1) - \nabla^2 f(x_2)| \le \rho |x_1 - x_2|$.</p>
</blockquote>
<p>Similarly to classical theory, which studies convergence to a first-order stationary point, $\nabla f(x) = 0$, by bounding the number of iterations to find a <strong>$\epsilon$-first-order stationary point</strong>, $|\nabla f(x)| \le \epsilon$, we formulate the speed of escape of strict saddle points and the ensuing convergence to a second-order stationary point, $\nabla f(x) = 0, \lambda_{\min}(\nabla^2 f(x)) \ge 0$, with an $\epsilon$-version of the definition:</p>
<blockquote>
<p><strong>Definition</strong>: A point $x$ is an <strong>$\epsilon$-second-order stationary point</strong> if:<br />
$\quad\quad\quad\quad |\nabla f(x)|\le \epsilon$, and $\lambda_{\min}(\nabla^2 f(x)) \ge -\sqrt{\rho \epsilon}$.</p>
</blockquote>
<p>In this definition, $\rho$ is the Hessian Lipschitz constant introduced above. This scaling follows the convention of <a href="http://rd.springer.com/article/10.1007%2Fs10107-006-0706-8">Nesterov and Polyak 2006</a>.</p>
<h3 id="applications">Applications</h3>
<p>In a wide range of practical nonconvex problems it has been proved that <strong>all saddle points are strict</strong> — such problems include, but not are limited to, principal components analysis, canonical correlation analysis,
<a href="http://arxiv.org/abs/1503.02101">orthogonal tensor decomposition</a>,
<a href="http://arxiv.org/abs/1602.06664">phase retrieval</a>,
<a href="http://arxiv.org/abs/1504.06785">dictionary learning</a>,
<!-- matrix factorization, -->
<a href="http://arxiv.org/abs/1605.07221">matrix sensing</a>,
<a href="http://arxiv.org/abs/1605.07272">matrix completion</a>,
and <a href="http://arxiv.org/abs/1704.00708">other nonconvex low-rank problems</a>.</p>
<p>Furthermore, in all of these nonconvex problems, it also turns out that <strong>all local minima are global minima</strong>. Thus, in these cases, any general efficient algorithm for finding $\epsilon$-second-order stationary points immediately becomes an efficient algorithm for solving those nonconvex problem with global guarantees.</p>
<h2 id="escaping-saddle-point-with-negligible-overhead">Escaping Saddle Point with Negligible Overhead</h2>
<p>In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:</p>
<blockquote>
<p><strong>Theorem (<a href="http://rd.springer.com/book/10.1007%2F978-1-4419-8853-9">Nesterov 1998</a>)</strong>: If Assumption 1 holds, then GD, with $\eta = 1/\ell$, finds an $\epsilon$-<strong>first</strong>-order stationary point in $2\ell (f(x_0) - f^\star)/\epsilon^2$ iterations.</p>
</blockquote>
<p>In this theorem, $x_0$ is the initial point and $f^\star$ is the function value of the global minimum. The theorem says for that any gradient-Lipschitz function, a stationary point can be found by GD in $O(1/\epsilon^2)$ steps, with no explicit dependence on $d$. This is called “dimension-free optimization” in the literature; of course the cost of a gradient computation is $O(d)$, and thus the overall runtime of GD scales as $O(d)$. The linear scaling in $d$ is especially important for modern high-dimensional nonconvex problems such as deep learning.</p>
<p>We now wish to address the corresponding problem for second-order stationary points.
What is the best we can hope for? Can we also achieve</p>
<ol>
<li>A dimension-free number of iterations;</li>
<li>An $O(1/\epsilon^2)$ convergence rate;</li>
<li>The same dependence on $\ell$ and $(f(x_0) - f^\star)$ as in (Nesterov 1998)?</li>
</ol>
<p>Rather surprisingly, the answer is <em>Yes</em> to all three questions (up to small log factors).</p>
<blockquote>
<p><strong>Main Theorem</strong>: If Assumptions 1 and 2 hold, then PGD, with $\eta = O(1/\ell)$, finds an $\epsilon$-<strong>second</strong>-order stationary point in $\tilde{O}(\ell (f(x_0) - f^\star)/\epsilon^2)$ iterations with high probability.</p>
</blockquote>
<p>Here $\tilde{O}(\cdot)$ hides only logarithmic factors; indeed, the dimension dependence in our result is only $\log^4(d)$. The theorem thus asserts that a perturbed form of GD, under an additional Hessian-Lipschitz condition, <strong><em>converges to a second-order-stationary point in almost the same time required for GD to converge to a first-order-stationary point.</em></strong> In this sense, we claim that PGD can escape strict saddle points almost for free.</p>
<p>We turn to a discussion of some of the intuitions underlying these results.</p>
<h3 id="why-do-polylogd-iterations-suffice">Why do polylog(d) iterations suffice?</h3>
<p>Our strict-saddle assumption means that there is only, in the worst case, one direction in $d$ dimensions along which we can escape. A naive search for the descent direction intuitively should take at least $\text{poly}(d)$ iterations, so why should only $\text{polylog}(d)$ suffice?</p>
<p>Consider a simple case in which we assume that the function is quadratic in the neighborhood of the saddle point. That is, let the objective function be $f(x) = x^\top H x$, a saddle point at zero, with constant Hessian $H = \text{diag}(-1, 1, \cdots, 1)$. In this case, only the first direction is an escape direction (with negative eigenvalue $-1$).</p>
<p>It is straightforward to work out the general form of the iterates in this case:</p>
<script type="math/tex; mode=display">x_t = x_{t-1} - \eta \nabla f(x_{t-1}) = (I - \eta H)x_{t-1} = (I - \eta H)^t x_0.</script>
<p>Assume that we start at the saddle point at zero, then add a perturbation so that $x_0$ is sampled uniformly from a ball $\mathcal{B}_0(1)$ centered at zero with radius one.
The decrease in the function value can be expressed as:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = x_t^\top H x_t = x_0^\top (I - \eta H)^t H (I - \eta H)^t x_0.</script>
<p>Set the step size to be $1/2$, let $\lambda_i$ denote the $i$-th eigenvalue of the Hessian $H$ and let $\alpha_i = e_i^\top x_0$ denote the component in the $i$th direction of the initial point $x_0$. We have $\sum_{i=1}^d \alpha_i^2 = | x_0|^2 = 1$, thus:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = \sum_{i=1}^d \lambda_i (1-\eta\lambda_i)^{2t} \alpha_i^2 \le -1.5^{2t} \alpha_1^2 + 0.5^{2t}.</script>
<p>A simple probability argument shows that sampling uniformly in $\mathcal{B}_0(1)$ will result in at least a $\Omega(1/d)$ component in the first direction with high probability. That is, $\alpha^2_1 = \Omega(1/d)$. Substituting $\alpha_1$ in the above equation, we see that it takes at most $O(\log d)$ steps for the function value to decrease by a constant amount.</p>
<h3 id="pancake-shape-stuck-region-for-general-hessian">Pancake-shape stuck region for general Hessian</h3>
<p>We can conclude that for the case of a constant Hessian, only when the perturbation $x_0$ lands in the set $\{x | ~ |e_1^\top x|^2 \le O(1/d)\}$ $\cap \mathcal{B}_0 (1)$, can we take a very long time to escape the saddle point. We call this set the <strong>stuck region</strong>; in this case it is a flat disk. In general, when the Hessian is no longer constant, the stuck region becomes a non-flat pancake, depicted as a green object in the left graph. In general this region will not have an analytic expression.</p>
<p>Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — <em>although we don’t know the shape of the stuck region, we know it is very thin</em>.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/flow.png" width="85%" alt="Pancake" />
</p>
<p>In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points $w, u$ separated by $O(1/\sqrt{d})$ along an escaping direction. We claim that if we run GD starting at $w$ and $u$, at least one of the resulting trajectories will escape the saddle point very quickly. This implies that the thickness of the stuck region can be at most $O(1/\sqrt{d})$, so a random perturbation has very little chance to land in the stuck region.</p>
<h2 id="on-the-necessity-of-adding-perturbations">On the Necessity of Adding Perturbations</h2>
<p>We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent <a href="http://arxiv.org/abs/1705.10412">joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh</a>, we have shown that even with fairly natural random initialization schemes and non-pathological functions, <strong>GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.</strong></p>
<p>To establish this result, we considered random initializations from a very general class including Gaussians and uniform distributions over the hypercube, and we constructed a smooth objective function that satisfies both Assumptions 1 and 2. This function is constructed such that, even with random initialization, with high probability both GD and PGD have to travel sequentially in the vicinity of $d$ strict saddle points before reaching a local minimum. All strict saddle points have only one direction of escape. (See the left graph for the case of $d=2$).</p>
<p><img src="/assets/saddle_eff/necesperturbation.png" alt="NecessityPerturbation" /></p>
<p>When GD travels in the vicinity of a sequence of saddle points, it can get closer and closer to the later saddle points, and thereby take longer and longer to escape. Indeed, the time to escape the $i$th saddle point scales as $e^{i}$. On the other hand, PGD is always able to escape any saddle point in a small number of steps independent of the history. This phenomenon is confirmed by our experiments; see, for example, an experiment with $d=10$ in the right graph.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.</p>
<p>There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.</p>
Wed, 19 Jul 2017 03:00:00 -0700
http://offconvex.github.io/2017/07/19/saddle-efficiency/
http://offconvex.github.io/2017/07/19/saddle-efficiency/