Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
The search for biologically plausible neural computation: A similarity-based approach<p>This is the second post in a series reviewing recent progress in designing artificial neural networks (NNs) that resemble natural NNs not just superficially, but on a deeper, algorithmic level. In addition to serving as models of natural NNs, such networks can serve as general-purpose machine learning algorithms. Respecting biological constraints, viewed naively as a handicap in developing competitive general-purpose machine learning algorithms, can instead facilitate the development of artificial NNs by restricting the search space of possible algorithms.</p>
<p>In <a href="http://www.offconvex.org/2016/11/03/MityaNN1/">the previous post</a>, we focused on the constraints that must be met for an unsupervised algorithm to be biologically plausible. For an algorithm to be implementable as a NN, it must be formulated in the online setting. In the corresponding NN, synaptic weight updates must be local, i.e. depend on the activity of only two neurons that the synapse connects. Then, we demonstrated that deriving NNs for dimensionality reduction in a conventional way - by minimizing the reconstruction error - results in multi-neuron networks with biologically implausible non-local learning rules.</p>
<p>In this post, we propose a different objective function which we term similarity matching. From this objective function, we derive an online algorithm implementable by a NN with local learning rules. Then, we introduce other similarity-based algorithms which include more biological features such as different neuron classes and nonlinear activation functions. Finally, we review similarity-matching algorithms with state-of-the-art performance.</p>
<h2 id="similarity-matching-objective-function">Similarity-matching objective function</h2>
<p>We start by stating an objective function that will be used to derive NNs for linear dimensionality reduction. Let ${\bf x}_t\in \mathbb{R}^n$, $t = 1,\ldots T$, be a set of data points (inputs) and ${\bf y}_t\in \mathbb{R}^k$, $t = 1,\ldots T$, ($k < n$) be their learned representation (outputs). The similarity of a pair of inputs, ${\bf x}$<sub><small>$t$</small></sub> and ${\bf x}$<sub><small>$t’$</small></sub>, can be defined as their dot-product, ${\bf x}$<sub><small>$t$</small></sub>${}^\top {\bf x}$<sub><small>$t’$</small></sub>. Analogously, the similarity of a pair of outputs is ${\bf y}$<sub><small>$t$</small></sub>${}^\top {\bf y}$<sub><small>$t’$</small></sub>. Similarity matching, as its name suggests, learns a representation where the similarity between each pair of outputs matches that of the corresponding inputs:</p>
<script type="math/tex; mode=display">\min_{ {\bf y}_1,\ldots,{\bf y}_T} \frac{1}{T^2} \sum_{t=1}^T \sum_{t'=1}^T \left({\bf x}_t^\top {\bf x}_{t'} - {\bf y}_t^\top {\bf y}_{t'} \right)^2. \qquad\qquad (2.1)</script>
<p>This offline objective function, <a href="https://en.wikipedia.org/wiki/Multidimensional_scaling">previously employed for multidimensional scaling</a>, is optimized by the projections of inputs onto the principal subspace of their covariance, i.e. performing PCA up to an orthogonal rotation. Moreover, <a href="http://papers.nips.cc/paper/6048-matrix-completion-has-no-spurious-local-minimum">(2.1) has no local minima other than the principal subspace solution</a>.</p>
<p>The similarity-matching objective (2.1) may seem like a strange choice for deriving an online algorithm implementable by a NN. In the online setting, inputs are streamed to the algorithm sequentially and each output must be computed before the next input arrives. Yet, in (2.1), pairs of inputs and outputs from different time points interact with each other. In addition, whereas ${\bf x}_t$ and ${\bf y}_t$ could be interpreted as inputs and outputs to a network, unlike in <a href="http://www.offconvex.org/2016/11/03/MityaNN1/">the reconstruction approach (1.4)</a>, synaptic weights do not appear explicitly in (2.1).</p>
<h2 id="variable-substitution-trick">Variable substitution trick</h2>
<p>Both of the above concerns can be resolved by a <a href="https://www.researchgate.net/publication/315570715_Why_Do_Similarity_Matching_Objectives_Lead_to_HebbianAnti-Hebbian_Networks">simple math trick akin to completing the square</a>. We first focus on the cross-term in (2.1), which we call similarity alignment. By re-ordering the variables and introducing a new variable, ${\bf W} \in \mathbb{R}^{k\times n}$, we obtain:</p>
<script type="math/tex; mode=display">- \frac{1}{T^2}\sum_{t=1}^T \sum_{t'=1}^T {\bf y}_{t}^\top {\bf y}_{t'} {\bf x}_t^\top {\bf x}_{t'}= - \frac{1}{T^2}\sum_{t=1}^T {\bf y}_{t}^\top \left[ \sum_{t'=1}^T {\bf y}_{t'} {\bf x}_{t'}^\top \right ] {\bf x}_t</script>
<script type="math/tex; mode=display">\qquad \qquad \qquad \qquad\qquad \qquad \qquad \quad = \min_{ {\bf W} \in \mathbb{R}^{k\times n}} -\frac{2}{T} \sum_{t=1}^T{\bf y}_t^\top {\bf W} {\bf x}_t + {\rm Tr } {\bf W}^\top {\bf W}.\; (2.2)</script>
<p>To prove the second identity, find optimal ${\bf W}$ by taking a derivative of the expression on the right with respect to ${\bf W}$ and setting it to zero, and then substitute the optimal ${\bf W}$ back into the expression. Similarly, for the quartic ${\bf y}_t$ term in (2.1):</p>
<script type="math/tex; mode=display">\frac{1}{T^2}\sum_{t=1}^T \sum_{t'=1}^T {\bf y}_{t}^\top {\bf y}_{t'} {\bf y}_t^\top {\bf y}_{t'}= \frac{1}{T^2}\sum_{t=1}^T {\bf y}_{t}^\top \left[ \sum_{t'=1}^T {\bf y}_{t'} {\bf y}_{t'}^\top \right ] {\bf y}_t</script>
<script type="math/tex; mode=display">\qquad \qquad \qquad \qquad\qquad \qquad \qquad \quad = \max_{ {\bf M} \in \mathbb{R}^{k\times k}} \frac{2}{T} \sum_{t=1}^T{\bf y}_t^\top {\bf M} {\bf y}_t - {\rm Tr } {\bf M}^\top {\bf M}.\quad(2.3)</script>
<p>By substituting (2.2) and (2.3) into (2.1) we get:</p>
<script type="math/tex; mode=display">\min_{ {\bf W}\in \mathbb{R}^{k\times n}}\max_{ {\bf M}\in \mathbb{R}^{k\times k}} \frac{1}{T} \sum_{t=1}^T \left[2 {\rm Tr}\left({\bf W}^\top{\bf W}\right) - {\rm Tr}\left({\bf M}^\top{\bf M}\right) + \min_{ {\bf y}_t\in \mathbb{R}^{k\times 1}} l_t({\bf W},{\bf M},{\bf y}_t)\right],\; (2.4)</script>
<p>where</p>
<script type="math/tex; mode=display">l_t({\bf W},{\bf M},{\bf y}_t)=-4{\bf x}_t^\top{\bf W}{\bf y}_t + 2{\bf y}_t^\top{\bf M}{\bf y}_t.\qquad\qquad (2.5)</script>
<p>In the resulting objective function, (2.4),(2.5), optimal outputs at different time steps can be computed independently, making the problem amenable to an online algorithm. The price paid for this simplification is the appearance of the minimax optimization problem in variables, ${\bf W}$ and ${\bf M}$. Minimization over ${\bf W}$ aligns output channels with the greatest variance directions of the input and maximization over ${\bf M}$ diversifies the output channels. The competition between the two in a gradient descent/ascent algorithm results in the principal subspace projection which is <a href="https://www.researchgate.net/publication/315570715_Why_Do_Similarity_Matching_Objectives_Lead_to_HebbianAnti-Hebbian_Networks">the only stable fixed point of the corresponding dynamics</a>.</p>
<h2 id="online-algorithm-and-neural-network">Online algorithm and neural network</h2>
<p>Now, we are ready to derive an algorithm for optimizing (2.1) online. First, by minimizing (2.5) with respect to ${\bf y}_t$ while keeping ${\bf W}$ and ${\bf M}$ fixed we get the dynamics for the output variables :</p>
<script type="math/tex; mode=display">\dot{\bf y}_t={\bf W} {\bf x}_t-{\bf M} {\bf y}_t.\qquad\qquad (2.6)</script>
<p>To find ${\bf y}_t$ after the presentation of the corresponding input, ${\bf x}_t$, (2.6) is iterated until convergence.</p>
<p>After the convergence of ${\bf y}_t$ we update ${\bf W}$ and ${\bf M}$ by gradient descent of (2.2) and gradient ascent of (2.3) respectively:</p>
<script type="math/tex; mode=display">W_{ij} \leftarrow W_{ij} + \eta \left(y_ix_j-W_{ij}\right), \qquad M_{ij} \leftarrow M_{ij} + \eta \left(y_iy_j-M_{ij}\right). \qquad (2.7)</script>
<p>Algorithm (2.6),(2.7), first derived <a href="https://www.researchgate.net/publication/273003026_A_HebbianAnti-Hebbian_Neural_Network_for_Linear_Subspace_Learning_A_Derivation_from_Multidimensional_Scaling_of_Streaming_Data">here</a>, can be naturally implemented by a biologically plausible NN, Figure 1. Here, activity (firing rate) of the upstream neurons corresponds to input variables. Output variables are computed by the dynamics of activity (2.6) in a single layer of neurons. Variables ${\bf W}$ and ${\bf M}$ are represented by the weights of synapses in feedforward and lateral connections respectively. The learning rules (2.7) are local, i.e. the weight update, $\Delta W_{ij}$, for the synapse between $j^{\rm th}$ input neuron and $i^{\rm th}$ output neuron depends only on the activities, $x_j$, of $j^{\rm th}$ input neuron and, $y_i$, of $i^{\rm th}$ output neuron, and the synaptic weight. In neuroscience, learning rules (2.7) for ${\bf W}$ and ${\bf M}$ are called Hebbian and anti-Hebbian respectively.</p>
<p><img src="https://drive.google.com/uc?export=view&id=18_7ApBztw00VwUn-Y81kOC4s8XTuXVVN" alt="" /></p>
<p><em>Figure 1: A Hebbian/Anti-Hebbian network derived from similarity matching.</em></p>
<p>To summarize, starting with the similarity-matching objective, we derived a Hebbian/anti-Hebbian NN for dimensionality reduction. The minimax objective can be viewed as a zero-sum game played by the weights of feedforward and lateral connections. This demonstrates that synapses with local updates can still collectively work together to optimize a global objective. A similar, although not identical, NN was proposed by Foldiak heuristically. The advantage of our normative approach is that the offline solution is known. Although no proof of convergence exists in the online setting, algorithm (2.6),(2.7) performs well in practice.</p>
<h2 id="other-similarity-based-objectives-and-linear-networks">Other similarity-based objectives and linear networks</h2>
<p>We used the same framework to derive NNs for other computational tasks and incorporating more biological features. As the algorithm (2.6),(2.7) and the NN in Figure 1 were derived from the similarity-matching objective (2.1), they project data onto the principal subspace but do not necessarily recover principal components <em>per se</em>. To derive PCA algorithms we modified the objective function (2.1), <a href="https://arxiv.org/abs/1511.09468">here</a> and <a href="https://arxiv.org/abs/1810.06966">here</a>, to encourage orthogonality of ${\bf W}$. Such algorithms are implemented by NNs of the same architecture as in Figure 1 but with slightly different learning rules.</p>
<p>Although the similarity-matching NN in Figure 1 relies on biologically plausible local learning rules, it lacks biological realism in several other ways. For example, computing output requires recurrent activity that must settle faster than the time scale of the input variation, which is unlikely in biology. To respect this biological constraint, <a href="https://arxiv.org/abs/1810.06966">we modified</a> the dimensionality reduction algorithm to avoid recurrency.</p>
<p>Another non-biological feature of the NN in Figure 1 is that the output neurons compete with each other by communicating via lateral connections. In biology, such interactions are not direct but mediated by interneurons. To reflect these observations, we modified the objective function by introducing a whitening constraint:</p>
<script type="math/tex; mode=display">\min_{ {\bf y}_1,\ldots,{\bf y}_T} - \frac{1}{T^2}\sum_{t=1}^T \sum_{t'=1}^T {\bf y}_{t}^\top {\bf y}_{t'} {\bf x}_t^\top {\bf x}_{t'}, \qquad {\rm s.t.} \quad \frac 1T \sum_{t=1}^T {\bf y}_t {\bf y}_t^\top = {\bf I}_k, \qquad (2.8)</script>
<p>where ${\bf I}$<sub><small>$k$</small></sub> is the $k$-by-$k$ identity matrix. Then, by representing the whitening constraint using Lagrange relaxation, <a href="http://papers.nips.cc/paper/5885-a-normative-theory-of-adaptive-dimensionality-reduction-in-neural-networks">we derived NNs</a> where interneurons appear naturally - their activity is modeled by the Lagrange multipliers, ${\bf z} _t^\top {\bf z} _{t’}$ (Figure 2):</p>
<script type="math/tex; mode=display">\min_{ {\bf y}_1,\ldots,{\bf y}_T} \max_{ {\bf z}_1,\ldots,{\bf z}_T } - \frac{1}{T^2}\sum_{t=1}^T \sum_{t'=1}^T {\bf y}_{t}^\top {\bf y}_{t'} {\bf x}_t^\top {\bf x}_{t'} + \frac{1}{T^2}\sum_{t=1}^T \sum_{t'=1}^T {\bf z}_{t}^\top {\bf z}_{t'} \left({\bf y}_{t}^\top {\bf y}_{t'} -\delta_{t,t'}\right). \; (2.9)</script>
<p>Notice how (2.9) contains the ${\bf y}$-${\bf z}$ similarity-alignment term similar to (2.2). We can now derive learning rules for the ${\bf y}$-${\bf z}$ connections using the variable substitution trick, leading to the network in Figure 2. For details of this and other NN derivations, see <a href="http://papers.nips.cc/paper/5885-a-normative-theory-of-adaptive-dimensionality-reduction-in-neural-networks">here</a>.</p>
<p><img src="https://drive.google.com/uc?export=view&id=1tYmjxDN2SUZY-8--uV_qGTb_gEIg0O8C" alt="" /></p>
<p><em>Figure 2: A biologically-plausible NN for whitening inputs, derived from a constrained similarity-alignment cost function.</em></p>
<h2 id="nonnegative-similarity-matching-objective-and-a-nonlinear-network">Nonnegative similarity-matching objective and a nonlinear network</h2>
<p>So far, we considered similarity-based NNs with linear neurons. However, biological neurons are not linear and many interesting computations require nonlinearity. A resolution to this discrepancy was suggested by the observation that the output of biological neurons is nonnegative (firing rate cannot be below zero). Hence, we modified the optimization problem by requiring that the output of the similarity-matching cost function (2.1) is nonnegative:</p>
<script type="math/tex; mode=display">\min_{ {\bf y}_1,\ldots,{\bf y}_T \geq 0} \frac{1}{T^2} \sum_{t=1}^T \sum_{t'=1}^T \left({\bf x}_t^\top {\bf x}_{t'} - {\bf y}_t^\top {\bf y}_{t'} \right)^2. \qquad\qquad (2.10)</script>
<p>Solutions of the optimization problem (2.10) are very different from PCA: They can <a href="https://arxiv.org/abs/1503.00680">cluster well-segregated data and extract sparse features from data</a>. Understanding the nature of these solutions will be the topic of the next post. For now, we note that (2.10) can be solved by the same online algorithm as (2.1) except that the output variables are projected onto the nonnegative domain. Such algorithm maps onto the same network as Figure 1 but with rectifying neurons (ReLUs), Figure 3A.</p>
<p><img src="https://drive.google.com/uc?export=view&id=1QahRt7mzirInGnmuSJh-4R4b88q9gyRq" alt="" /></p>
<p><em>Figure 3: A) A nonlinear Hebbian/Anti-Hebbian network derived from nonnegative similarity matching. B) Stacked network for NICA. NSM - nonnegative similarity-matching.</em></p>
<p>Another problem solved by similarity-based networks is <a href="https://www.researchgate.net/publication/317300210_Blind_Nonnegative_Source_Separation_Using_Biological_Neural_Networks">the nonnegative independent component analysis</a> (NICA) which can be used for blind source separation. The problem is to recover independent and nonnegative sources from observing only their linear mixture. <a href="https://www.researchgate.net/publication/3342750_Conditions_for_nonnegative_independent_component_analysis">Plumbley showed</a> that NICA can be solved in two steps, Figure 4. First, whiten the data to obtain an orthogonal rotation of the sources. Second, find an orthogonal rotation of the whitened sources that yields a nonnegative output, Figure 4. The first step can be implemented by the whitening network in Figure 2. The second step can be implemented by the nonnegative similarity-matching network, Figure 3A, because an orthogonal rotation does not affect dot-product similarities. Therefore, <a href="https://www.researchgate.net/publication/317300210_Blind_Nonnegative_Source_Separation_Using_Biological_Neural_Networks">NICA is solved by stacking the whitening and the nonnegative similarity-matching networks</a>, Figure 3B.</p>
<p><img src="https://drive.google.com/uc?export=view&id=16uJTV1QQpwcZcueXLCqNJieKzyWk0ZYd" alt="" /></p>
<p><em>Figure 4: Illustration of the nonnegative independent component analysis algorithm. Two source channels (left) are linearly transformed to a two-dimensional mixture, which are the inputs to the algorithm (middle). Whitening (right) yields an orthogonal rotation of the sources. Sources are then recovered by solving the nonnegative similarity-matching problem. Green and red plus signs track two source vectors across mixing and whitening stages.</em></p>
<h2 id="similarity-based-algorithms-as-general-purpose-tools">Similarity-based algorithms as general-purpose tools</h2>
<p>While the derivation of the similarity-matching algorithm was motivated by constraints imposed by biology, the resulting algorithm performs well on large-scale data. <a href="https://arxiv.org/abs/1808.02083">A recent paper</a> introduced an efficient modification of the similarity-matching algorithm and demonstrated its competitiveness with the state-of-the-art principal subspace projection algorithms in both processing speed and convergence rate. A package with implementations of these algorithms is <a href="https://github.com/flatironinstitute/online_psp">here</a> and <a href="https://github.com/flatironinstitute/online_psp_matlab">here</a>.</p>
<p>In this blog post, we introduced linear and non-linear similarity-matching NNs that can serve as models of biological NNs and as general-purpose machine-learning tools. In the next post, we will discuss the nature of the solutions to nonnegative similarity-based networks.</p>
Mon, 03 Dec 2018 15:30:00 -0800
http://offconvex.github.io/2018/12/03/MityaNN2/
http://offconvex.github.io/2018/12/03/MityaNN2/Understanding optimization in deep learning by analyzing trajectories of gradient descent<p>Neural network optimization is fundamentally non-convex, and yet simple gradient-based algorithms seem to consistently solve such problems.
This phenomenon is one of the central pillars of deep learning, and forms a mystery many of us theorists are trying to unravel.
In this post I’ll survey some recent attempts to tackle this problem, finishing off with a discussion on my <a href="https://arxiv.org/pdf/1810.02281.pdf">new paper with Sanjeev Arora, Noah Golowich and Wei Hu</a>, which for the case of gradient descent over deep linear neural networks, provides a guarantee for convergence to global minimum at a linear rate.</p>
<h2 id="landscape-approach-and-its-limitations">Landscape Approach and Its Limitations</h2>
<p>Many papers on optimization in deep learning implicitly assume that a rigorous understanding will follow from establishing geometric properties of the loss <em>landscape</em>, and in particular, of <em>critical points</em> (points where the gradient vanishes).
For example, through an analogy with the spherical spin-glass model from condensed matter physics, <a href="http://proceedings.mlr.press/v38/choromanska15.pdf">Choromanska et al. 2015</a> argued for what has become a colloquial conjecture in deep learning:</p>
<blockquote>
<p><strong>Landscape Conjecture:</strong>
In neural network optimization problems, suboptimal critical points are very likely to have negative eigenvalues to their Hessian.
In other words, there are almost <em>no poor local minima</em>, and nearly all <em>saddle points are strict</em>.</p>
</blockquote>
<p>Strong forms of this conjecture were proven for loss landscapes of various simple problems involving <strong>shallow</strong> (two layer) models, e.g. <a href="https://papers.nips.cc/paper/6271-global-optimality-of-local-search-for-low-rank-matrix-recovery.pdf">matrix sensing</a>, <a href="https://papers.nips.cc/paper/6048-matrix-completion-has-no-spurious-local-minimum.pdf">matrix completion</a>, <a href="http://proceedings.mlr.press/v40/Ge15.pdf">orthogonal tensor decomposition</a>, <a href="https://arxiv.org/pdf/1602.06664.pdf">phase retrieval</a>, and <a href="http://proceedings.mlr.press/v80/du18a/du18a.pdf">neural networks with quadratic activation</a>.
There was also work on establishing convergence of gradient descent to global minimum when the Landscape Conjecture holds, as described in the excellent posts on this blog by <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge</a>, <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht</a> and <a href="http://www.offconvex.org/2017/07/19/saddle-efficiency/">Chi Jin and Michael Jordan</a>.
They describe how gradient descent can arrive at a second order local minimum (critical point whose Hessian is positive semidefinite) by escaping all strict saddle points, and how this process is efficient given that perturbations are added to the algorithm.
Note that under the Landscape Conjecture, i.e. when there are no poor local minima and non-strict saddles, second order local minima are also global minima.</p>
<p style="text-align:center;">
<img src="/assets/optimization-beyond-landscape-points.png" width="100%" alt="Local minima and saddle points" />
</p>
<p>However, it has become clear that the landscape approach (and the Landscape Conjecture) cannot be applied as is to <strong>deep</strong> (three or more layer) networks, for several reasons.
First, deep networks typically induce non-strict saddles (e.g. at the point where all weights are zero, see <a href="https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf">Kawaguchi 2016</a>).
Second, a landscape perspective largely ignores algorithmic aspects that empirically are known to greatly affect convergence with deep networks — for example the <a href="http://proceedings.mlr.press/v28/sutskever13.html">type of initialization</a>, or <a href="http://proceedings.mlr.press/v37/ioffe15.pdf">batch normalization</a>.
Finally, as I argued in my <a href="http://www.offconvex.org/2018/03/02/acceleration-overparameterization/">previous blog post</a>, based upon <a href="http://proceedings.mlr.press/v80/arora18a/arora18a.pdf">work with Sanjeev Arora and Elad Hazan</a>, adding (redundant) linear layers to a classic linear model can sometimes accelerate gradient-based optimization, without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem.
Any landscape analysis that relies on properties of critical points alone will have difficulty explaining this phenomenon, as through such lens, nothing is easier to optimize than a convex objective with a single critical point which is the global minimum.</p>
<h2 id="a-way-out">A Way Out?</h2>
<p>The limitations of the landscape approach for analyzing optimization in deep learning suggest that it may be abstracting away too many important details.
Perhaps a more relevant question than “is the landscape graceful?” is “what is the behavior of specific optimizer <strong>trajectories</strong> emanating from specific initializations?”.</p>
<p style="text-align:center;">
<img src="/assets/optimization-beyond-landscape-trajectories.png" width="66%" alt="Different trajectories lead to qualitatively different results" />
</p>
<p>While the trajectory-based approach is seemingly much more burdensome than landscape analyses, it is already leading to notable progress.
Several recent papers (e.g. <a href="http://proceedings.mlr.press/v70/brutzkus17a/brutzkus17a.pdf">Brutzkus and Globerson 2017</a>; <a href="https://papers.nips.cc/paper/6662-convergence-analysis-of-two-layer-neural-networks-with-relu-activation.pdf">Li and Yuan 2017</a>; <a href="http://proceedings.mlr.press/v70/zhong17a/zhong17a.pdf">Zhong et al. 2017</a>; <a href="http://proceedings.mlr.press/v70/tian17a/tian17a.pdf">Tian 2017</a>; <a href="https://openreview.net/pdf?id=rJ33wwxRb">Brutzkus et al. 2018</a>; <a href="http://proceedings.mlr.press/v75/li18a/li18a.pdf">Li et al. 2018</a>; <a href="https://arxiv.org/pdf/1806.00900.pdf">Du et al. 2018</a>; <a href="http://romaincouillet.hebfree.org/docs/conf/nips_GDD.pdf">Liao et al. 2018</a>) have adopted this strategy, successfully analyzing different types of shallow models.
Moreover, trajectory-based analyses are beginning to set foot beyond the realm of the landscape approach — for the case of linear neural networks, they have successfully established convergence of gradient descent to global minimum under <strong>arbitrary depth</strong>.</p>
<h2 id="trajectory-based-analyses-for-deep-linear-neural-networks">Trajectory-Based Analyses for Deep Linear Neural Networks</h2>
<p>Linear neural networks are fully-connected neural networks with linear (no) activation.
Specifically, a depth $N$ linear network with input dimension $d_0$, output dimension $d_N$, and hidden dimensions $d_1,d_2,\ldots,d_{N-1}$, is a linear mapping from $\mathbb{R}^{d_0}$ to $\mathbb{R}^{d_N}$ parameterized by $x \mapsto W_N W_{N-1} \cdots W_1 x$, where $W_j \in \mathbb{R}^{d_j \times d_{j-1}}$ is regarded as the weight matrix of layer $j$.
Though trivial from a representational perspective, linear neural networks are, somewhat surprisingly, complex in terms of optimization — they lead to non-convex training problems with multiple minima and saddle points.
Being viewed as a theoretical surrogate for optimization in deep learning, the application of gradient-based algorithms to linear neural networks is receiving significant attention these days.</p>
<p>To my knowledge, <a href="https://arxiv.org/pdf/1312.6120.pdf">Saxe et al. 2014</a> were the first to carry out a trajectory-based analysis for deep (three or more layer) linear networks, treating gradient flow (gradient descent with infinitesimally small learning rate) minimizing $\ell_2$ loss over whitened data.
Though a very significant contribution, this analysis did not formally establish convergence to global minimum, nor treat the aspect of computational complexity (number of iterations required to converge).
The recent work of <a href="http://proceedings.mlr.press/v80/bartlett18a.html">Bartlett et al. 2018</a> makes progress towards addressing these gaps, by applying a trajectory-based analysis to gradient descent for the special case of linear residual networks, i.e. linear networks with uniform width across all layers ($d_0=d_1=\cdots=d_N$) and identity initialization ($W_j=I$, $\forall j$).
Considering different data-label distributions (which boil down to what they refer to as “targets”), Bartlett et al. demonstrate cases where gradient descent provably converges to global minimum at a linear rate — loss is less than $\epsilon>0$ from optimum after $\mathcal{O}(\log\frac{1}{\epsilon})$ iterations — as well as situations where it fails to converge.</p>
<p>In a <a href="https://arxiv.org/pdf/1810.02281.pdf">new paper with Sanjeev Arora, Noah Golowich and Wei Hu</a>, we take an additional step forward in virtue of the trajectory-based approach.
Specifically, we analyze trajectories of gradient descent for any linear neural network that does not include “bottleneck layers”, i.e. whose hidden dimensions are no smaller than the minimum between the input and output dimensions ($d_j \geq \min\{d_0,d_N\}$, $\forall j$), and prove convergence to global minimum, at a linear rate, provided that initialization meets the following two conditions:
<em>(i)</em> <em>approximate balancedness</em> — $W_{j+1}^\top W_{j+1} \approx W_j W_j^\top$, $\forall j$;
and <em>(ii)</em> <em>deficiency margin</em> — initial loss is smaller than the loss of any rank deficient solution.
We show that both conditions are necessary, in the sense that violating any one of them may lead to a trajectory that fails to converge.
Approximate balancedness at initialization is trivially met in the special case of linear residual networks, and also holds for the customary setting of initialization via small random perturbations centered at zero.
The latter also leads to deficiency margin with positive probability.
For the case $d_N=1$, i.e. scalar regression, we provide a random initialization scheme under which both conditions are met, and thus convergence to global minimum at linear rate takes place, with constant probability.</p>
<p>Key to our analysis is the observation that if weights are initialized to be approximately balanced, they will remain that way throughout the iterations of gradient descent.
In other words, trajectories taken by the optimizer adhere to a special characterization:</p>
<blockquote>
<p><strong>Trajectory Characterization:</strong> <br /></p>
<p><script type="math/tex">W_{j+1}^\top(t) W_{j+1}(t) \approx W_j(t) W_j^\top(t), \quad j=1,\ldots,N-1, \quad t=0,1,\ldots</script>,</p>
</blockquote>
<p>which means that throughout the entire timeline, all layers have (approximately) the same set of singular values, and the left singular vectors of each layer (approximately) coincide with the right singular vectors of the layer that follows.
We show that this regularity implies steady progress for gradient descent, thereby demonstrating that even in cases where the loss landscape is complex as a whole (includes many non-strict saddle points), it may be particularly well-behaved around the specific trajectories taken by the optimizer.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Tackling the question of optimization in deep learning through the landscape approach, i.e. by analyzing the geometry of the objective independently of the algorithm used for training, is conceptually appealing.
However this strategy suffers from inherent limitations, predominantly as it requires the entire objective to be graceful, which seems to be too strict of a demand.
The alternative approach of taking into account the optimizer and its initialization, and focusing on the landscape only along the resulting trajectories, is gaining more and more traction.
While landscape analyses have thus far been limited to shallow (two layer) models only, the trajectory-based approach has recently treated arbitrarily deep models, proving convergence of gradient descent to global minimum at a linear rate.
Much work however remains to be done, as this success covered only linear neural networks.
I expect the trajectory-based approach to be key in developing our formal understanding of gradient-based optimization for deep non-linear networks as well.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Wed, 07 Nov 2018 04:00:00 -0800
http://offconvex.github.io/2018/11/07/optimization-beyond-landscape/
http://offconvex.github.io/2018/11/07/optimization-beyond-landscape/Simple and efficient semantic embeddings for rare words, n-grams, and language features<p>Distributional methods for capturing meaning, such as word embeddings, often require observing many examples of words in context. But most humans can infer a reasonable meaning from very few or even a single occurrence. For instance, if we read “Porgies live in shallow temperate marine waters,” we have a good idea that a <em>porgy</em> is a fish. Since language corpora often have a long tail of “rare words,” it is an interesting problem to imbue NLP algorithms with this capability. This is especially important for n-grams (i.e., ordered n-tuples of words, like “ice cream”), many of which occur rarely in the corpus.</p>
<p>Here we describe a simple but principled approach called <em>à la carte</em> embeddings, described in our <a href="http://aclweb.org/anthology/P18-1002">ACL’18 paper</a> with Yingyu Liang, Tengyu Ma, and Brandon Stewart. It also easily extends to learning embeddings of arbitrary language features such as word-senses and $n$-grams. The paper also combines these with our recent <a href="http://www.offconvex.org/2018/06/25/textembeddings/">deep-learning-free text embeddings</a> to get simple deep-learning free text embeddings with even better performance on downstream classification tasks, quite competitive with deep learning approaches.</p>
<h2 id="inducing-word-embedding-from-their-contexts-a-surprising-linear-relationship">Inducing word embedding from their contexts: a surprising linear relationship</h2>
<p>Suppose a single occurrence of a word $w$ is surrounded by a sequence $c$ of words. What is a reasonable guess for the word embedding $v_w$ of $w$? For convenience, we will let $u_w^c$ denote the average of the word embeddings of words in $c$. Anybody who knows the word2vec method may reasonably guess the following.</p>
<blockquote>
<p><strong>Guess 1:</strong> Up to scaling, $u_w^c$ is a good estimate for $v_w$.</p>
</blockquote>
<p>Unfortunately, this totally fails. Even taking thousands of occurrences of $w$, the average of such estimates stays far from the ground truth embedding $v_w$. The following discovery should therefore be surprising (read below for a theoretical justification):</p>
<blockquote>
<p><strong>Theorem 1</strong> (From <a href="https://transacl.org/ojs/index.php/tacl/article/view/1346">this TACL’18 paper</a>): There is a single matrix $A$ (depending only upon the text corpus) such that $A u_w^c$ is a good estimate for $v_w$.</p>
</blockquote>
<p>Note that the best such $A$ can be found via linear regression by minimizing the average $|Au_w^c -v_w|^2$ over occurrences of frequent words $w$, for which we already have word embeddings.</p>
<p>Once such an $A$ has been learnt from frequent words, the induction of embeddings for new words works very well. As we receive more and more occurrences of $w$ the average of $Au_w^c$ over all sentences containing $w$ has cosine similarity $>0.9$ with the true word embedding $v_w$ (this holds for GloVe as well as word2vec).</p>
<p>Thus the learnt $A$ gives a way to induce embeddings for new words from a few or even a single occurrence. We call this the <em>à la carte</em> embedding of $w$, because we don’t need to pay the <em>prix fixe</em> of re-running GloVe or word2vec on the entire corpus each time a new word is needed.</p>
<h3 id="testing-embeddings-for-rare-words">Testing embeddings for rare words</h3>
<p>Using Stanford’s <a href="https://nlp.stanford.edu/~lmthang/morphoNLM/">Rare Words</a> dataset we created the
<a href="http://nlp.cs.princeton.edu/CRW/"><em>Contextual Rare Words</em></a> dataset where, along with word pairs and human-rated scores, we also provide contexts (i.e., few usages) for the rare words.</p>
<p>We compare the performance of our method with alternatives such as <a href="http://www.offconvex.org/2018/06/17/textembeddings/">top singular component removal and frequency down-weighting</a> and find that <em>à la carte</em> embedding consistently outperforms other methods and requires far fewer contexts to match their best performance.
Below we plot the increase in Spearman correlation with human ratings as the tested algorithms are given more samples of the words in context. We see that given only 8 occurences of the word, the <em>a la carte</em> method outperforms other baselines that’re given 128 occurences.</p>
<p style="text-align:center;">
<img src="/assets/ALCcrwplot.svg" width="60%" />
</p>
<p>Now we turn to the task mentioned in the opening para of this post. <a href="http://aclweb.org/anthology/D17-1030">Herbelot and Baroni</a> constructed a “nonce” dataset consisting of single-word concepts and their Wikipedia definitions, to test algorithms that “simulate the process by which a competent speaker encounters a new word in known contexts.” They tested various methods, including a modified version of word2vec.
As we show in the table below, <em>à la carte</em> embedding outperforms all their methods in terms of the average rank of the target vector’s similarity with the constructed vector. The true word embedding is among the closest 165 or so word vectors to our embedding.
(Note that the vocabulary size exceeds 200K, so this is considered a strong performance.)</p>
<p style="text-align:center;">
<img src="/assets/ALCnonce.svg" width="50%" />
</p>
<h2 id="a-theory-of-induced-embeddings-for-general-features">A theory of induced embeddings for general features</h2>
<p>Why should the matrix $A$ mentioned above exist in the first place?
Sanjeev, Yingyu, and Tengyu’s <a href="https://transacl.org/ojs/index.php/tacl/article/view/1346">TACL’18</a> paper together with Yuanzhi Li and Andrej Risteski gives a justification via a latent-variable model of corpus generation that is a modification of their earlier model described in <a href="https://transacl.org/ojs/index.php/tacl/article/view/742">TACL’16</a> (see also this <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">blog post</a>) The basic idea is to consider a random walk over an ellipsoid instead of the unit square.
Under this modification of the rand-walk model, whose approximate MLE objective is similar to that of GloVe, their first theorem shows the following:</p>
<script type="math/tex; mode=display">\exists~A\in\mathbb{R}^{d\times d}\textrm{ s.t. }v_w=A\mathbb{E} \left[\frac{1}{n}\sum\limits_{w'\in c}v_{w'}\bigg|w\in c\right]=A\mathbb{E}v_w^\textrm{avg}~\forall~w</script>
<p>where the expectation is taken over possible contexts $c$.</p>
<p>This result also explains the linear algebraic structure of the embeddings of polysemous words (words having multiple possible meanings, such as <em>tie</em>) discussed in an earlier <a href="http://www.offconvex.org/2016/07/10/embeddingspolysemy/">post</a>.
Assuming for simplicity that $tie$ only has two meanings (<em>clothing</em> and <em>game</em>), it is easy to see that its word embedding is a linear transformation of the sum of the average context vectors of its two senses:</p>
<script type="math/tex; mode=display">v_w=A\mathbb{E}v_w^\textrm{avg}=A\mathbb{E}\left[v_\textrm{clothing}^\textrm{avg}+v_\textrm{game}^\textrm{avg}\right]=A\mathbb{E}v_\textrm{clothing}^\textrm{avg}+A\mathbb{E}v_\textrm{game}^\textrm{avg}</script>
<p>The above also shows that we can get a reasonable estimate for the vector of the sense <em>clothing</em>, and, by extension many other features of interest, by setting $v_\textrm{clothing}=A\mathbb{E}v_\textrm{clothing}^\textrm{avg}$.
Note that this linear method also subsumes other context representations, such as removing the <a href="http://www.offconvex.org/2018/06/17/textembeddings/">top singular component or down-weighting frequent directions</a>.</p>
<h3 id="n-gram-embeddings">$n$-gram embeddings</h3>
<p>While the theory suggests existence of a linear transform between word embeddings and their context embeddings, one could also use this linear transform to induce embeddings for other kinds of linguistic features in context.
We test this hypothesis by inducing embeddings for $n$-grams by using contexts from a large text corpus and word embeddings trained on the same corpus.
A qualitative evaluation of the $n$-gram embeddings is done by finding the closest words to it in terms of cosine similarity between the embeddings.
As evident from the below figure, <em>à la carte</em> bigram embeddings capture the meaning of the phrase better than some other compositional and learned bigram embeddings.</p>
<p style="text-align:center;">
<img src="/assets/ALCngram_quality.png" width="65%" />
</p>
<h3 id="sentence-embeddings">Sentence embeddings</h3>
<p>We also use these $n$-gram embeddings to construct sentence embeddings, similarly to <a href="http://www.offconvex.org/2018/06/25/textembeddings/">DisC embeddings</a>, to evaluate on classification tasks.
A sentence is embedded as the concatenation of sums of embeddings for $n$-gram in the sentence for use in downstream classification tasks.
Using this simple approach we can match the performance of other linear and LSTM representations, even obtaining state-of-the-art results on some of them. Note that Logeswaran and Lee is a contemporary paper that uses deep nets.</p>
<p style="text-align:center;">
<img src="/assets/ALCngram_clf.svg" width="80%" />
</p>
<h2 id="discussion">Discussion</h2>
<p>Our <em>à la carte</em> method is simple, almost elementary, and yet gives results competitive with many other feature embedding methods and also beats them in many cases.
Can one do zero-shot learning of word embeddings, i.e. inducing embeddings for a words/features without any context?
Character level methods such as <a href="https://fasttext.cc/">fastText</a> can do this and it is a good problem to incorporate character level information into the <em>à la carte</em> approach (the few things we tried didn’t work so far).</p>
<p>The <em>à la carte</em> code is <a href="https://github.com/NLPrinceton/ALaCarte">available here</a>, allowing you to re-create the results described.</p>
Tue, 18 Sep 2018 02:00:00 -0700
http://offconvex.github.io/2018/09/18/alacarte/
http://offconvex.github.io/2018/09/18/alacarte/When Recurrent Models Don't Need to be Recurrent<p>In the last few years, deep learning practitioners have proposed a litany of
different sequence models. Although recurrent neural networks were once the
tool of choice, now models like the autoregressive
<a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Wavenet</a> or the
<a href="https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html">Transformer</a>
are replacing RNNs on a diverse set of tasks. In this post, we explore the
trade-offs between recurrent and feed-forward models. Feed-forward models can
offer improvements in training stability and speed, while recurrent models are
strictly more expressive. Intriguingly, this added expressivity does not seem to
boost the performance of recurrent models. Several groups have shown
feed-forward networks can match the results of the best recurrent models on
benchmark sequence tasks. This phenomenon raises an interesting question for
theoretical investigation:</p>
<blockquote>
<p>When and why can feed-forward networks replace recurrent neural networks
without a loss in performance?</p>
</blockquote>
<p>We discuss several proposed answers to this question and highlight our
<a href="https://arxiv.org/abs/1805.10369">recent work</a> that offers an explanation in
terms of a fundamental stability property.</p>
<h1 id="a-tale-of-two-sequence-models">A Tale of Two Sequence Models</h1>
<h2 id="recurrent-neural-networks">Recurrent Neural Networks</h2>
<p>The many variants of recurrent models all have a similar form. The model
maintains a state $h_t$ that summarizes the past sequence of inputs. At each
time step $t$, the state is updated according to the equation
[
h_{t+1} = \phi(h_t, x_t),
]
where $x_t$ is the input at time $t$, $\phi$ is a differentiable map, and $h_0$
is an initial state. In a vanilla recurrent neural network, the model is
parameterized by matrices $W$ and $U$, and the state is updated according to
[
h_{t+1} = \tanh(Wh_t + Ux_t).
]
In practice, the <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Long Short-Term Memory
(LSTM)</a> network is
more frequently used. In either case, to make predictions, the state is passed
to a function $f$, and the model predicts $y_t = f(h_t)$. Since the state $h_t$
is a function of all of the past inputs $x_0, \dots, x_t$, the prediction $y_t$
depends on the entire history $x_0, \dots, x_t$ as well.</p>
<p>A recurrent model can also be represented graphically.</p>
<p style="text-align:center;">
<img src="/assets/approx_recurrent/recurrent_net.png" width="500px" height="250px" />
</p>
<p>Recurrent models are fit to data using backpropagation. However, backpropagating
gradients from time step $T$ to time step $0$ often requires infeasibly large
amounts of memory, so essentially every implementation of a recurrent model
<em>truncates</em> the model and only backpropagates gradient $k$ times steps.</p>
<figure>
<p style="text-align:center;">
<img src="/assets/approx_recurrent/truncated_backprop.png" />
</p>
<figcaption>
<small>
Source: <a href="https://r2rt.com/styles-of-truncated-backpropagation.html">
https://r2rt.com/styles-of-truncated-backpropagation.html </a>
</small>
</figcaption>
</figure>
<p>In this setup, the predictions of the recurrent model still depend on the entire
history $x_0, \dots, x_T$. However, it’s not clear how this training procedure
affects the model’s ability to learn long-term patterns, particularly those that
require more than $k$ steps.</p>
<h2 id="autoregressive-feed-forward-models">Autoregressive, Feed-Forward Models</h2>
<p>Instead of making predictions from a state that depends on the entire history,
an autoregressive model directly predicts $y_t$ using only the $k$ most recent
inputs, $x_{t-k+1}, \dots, x_{t}$. This corresponds to a strong <em>conditional
independence</em> assumption. In particular, a feed-forward model assumes the target
only depends on the $k$ most recent inputs. Google’s
<a href="https://arxiv.org/abs/1609.03499">WaveNet</a> nicely illustrates this general
principle.</p>
<figure>
<p style="text-align:center;">
<img src="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif" />
</p>
<figcaption>
<small>
Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">
https://deepmind.com/blog/wavenet-generative-model-raw-audio/</a>
</small>
</figcaption>
</figure>
<p>In contrast to an RNN, the limited context of a feed-forward model means that it
cannot capture patterns that extend more than $k$ steps. However, using
techniques like dilated-convolutions, one can make $k$ quite large.</p>
<h1 id="why-care-about-feed-forward-models">Why Care About Feed-Forward Models?</h1>
<p>At the outset, recurrent models appear to be a strictly more flexible and
expressive model class than feed-forward models. After all, feed-forward
networks make a strong conditional independence assumption that recurrent models
don’t make. Even if feed-forward models are less expressive, there are still
several reasons one might prefer a feed-forward network.</p>
<ul>
<li><strong>Parallelization</strong>: Convolutional feed-forward models are easier to <a href="https://arxiv.org/abs/1705.03122">parallelize
at training time</a>.
There’s no hidden state to update and maintain, and
therefore no sequential dependencies between outputs. This allows very
efficient implementations of training on modern hardware.</li>
<li><strong>Trainability</strong>: Training deep convolutional neural networks is the
bread-and-butter of deep learning. Whereas recurrent models are often more
finicky and difficult to <a href="https://arxiv.org/abs/1211.5063">optimize</a>,
significant effort has gone into designing architectures and software to
efficiently and reliably train deep feed-forward networks.</li>
<li><strong>Inference Speed</strong>: In some cases, feed-forward models can be significantly
more light-weight and perform <a href="https://arxiv.org/abs/1211.5063">inference faster than similar recurrent
systems</a>. In other cases,
particularly for long sequences, autoregressive inference is a large
bottleneck and requires <a href="https://arxiv.org/abs/1702.07825">significant engineering
work</a> or <a href="https://arxiv.org/abs/1711.10433">significant
cleverness</a> to overcome.</li>
</ul>
<h1 id="feed-forward-models-can-outperform-recurrent-models">Feed-Forward Models Can Outperform Recurrent Models</h1>
<p>Although it appears trainability and parallelization for feed-forward models
comes at the price of reduced accuracy, there have been several recent examples
showing that feed-forward networks can actually achieve the same accuracies as
their recurrent counterparts on benchmark tasks.</p>
<ul>
<li>
<p><strong>Language Modeling.</strong>
In language modeling, the goal is to predict the next word in a document given
all of the previous words. Feed-forward models make predictions using only the
$k$ most recent words, whereas recurrent models can potentially use the entire
document. The <a href="https://arxiv.org/abs/1612.08083">Gated-Convolutional Language
Model</a> is a feed-forward autoregressive models
that is competitive with <a href="https://arxiv.org/abs/1602.02410">large LSTM baseline
models</a>. Despite using a truncation length of
$k=25$, the model outperforms a large LSTM on the
<a href="https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset">Wikitext-103</a>
benchmark, which is designed to reward models that capture long-term
dependencies. On the <a href="http://www.statmt.org/lm-benchmark/">Billion Word
Benchmark</a>, the model is slightly worse
than the largest LSTM, but is faster to train and uses fewer resources.</p>
</li>
<li>
<p><strong>Machine Translation.</strong>
The goal in machine translation is to map sequences of English words to
sequences of, say, French words. Feed-forward models make translations using
only $k$ words of the sentence, whereas recurrent models can leverage the entire
sentence. Within the deep learning world, variants of the LSTM-based <a href="https://arxiv.org/abs/1409.0473">Sequence
to Sequence with Attention</a> model, particularly
<a href="https://arxiv.org/abs/1609.08144">Google Neural Machine Translation</a>, were
superseded first by a fully <a href="https://arxiv.org/abs/1705.03122">convolutional sequence to
sequence</a> model and then by the
<a href="https://arxiv.org/abs/1706.03762">Transformer</a>.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
</li>
</ul>
<figure>
<p style="text-align:center;">
<img src="https://raw.githubusercontent.com/facebookresearch/fairseq/master/fairseq.gif" />
</p>
<figcaption>
<small>
Source: <a href="https://github.com/facebookresearch/fairseq/blob/master/fairseq.gif">
https://github.com/facebookresearch/fairseq/blob/master/fairseq.gif </a>
</small>
</figcaption>
</figure>
<ul>
<li>
<p><strong>Speech Synthesis.</strong>
In speech synthesis, one seeks to generate a realistic human speech signal.
Feed-forward models are limited to the past $k$ samples, whereas recurrent
models can use the entire history. Upon publication, the feed-forward,
autoregressive <a href="https://arxiv.org/abs/1609.03499">WaveNet</a> was a substantial
improvement over LSTM-RNN parametric models.</p>
</li>
<li>
<p><strong>Everthing Else.</strong>
Recently <a href="https://arxiv.org/abs/1803.01271">Bai et al.</a> proposed a generic
feed-forward model leveraging dilated convolutions and showed it outperforms
recurrent baselines on tasks ranging from synthetic copying tasks to music
generation.</p>
</li>
</ul>
<h1 id="how-can-feed-forward-models-outperform-recurrent-ones">How Can Feed-Forward Models Outperform Recurrent Ones?</h1>
<p>In the examples above, feed-forward networks achieve results on par with or
better than recurrent networks. This is perplexing since recurrent models
seem to be more powerful a priori. One explanation for this phenomenon is
given by <a href="https://arxiv.org/abs/1612.08083">Dauphin et al.</a>:</p>
<blockquote>
<p>The unlimited context offered by recurrent models is not strictly necessary
for language modeling.</p>
</blockquote>
<p>In other words, it’s possible you don’t need a large amount of context to do
well on the prediction task on average. <a href="https://arxiv.org/abs/1612.02526">Recent theoretical
work</a> offers some evidence in favor of this view.</p>
<p>Another explanation is given by <a href="https://arxiv.org/abs/1803.01271">Bai et al.</a>:</p>
<blockquote>
<p>The “infinite memory” advantage of RNNs is largely absent in practice.</p>
</blockquote>
<p>As Bai et al. report, even in experiments explicitly requiring long-term
context, RNN variants were unable to learn long sequences. On the Billion Word
Benchmark, an <a href="https://arxiv.org/abs/1703.10724">intriguing Google Technical
Report</a> suggests an LSTM $n$-gram model with
$n=13$ words of memory is as good as an LSTM with arbitrary context.</p>
<p>This evidence leads us to conjecture: <strong>Recurrent models <em>trained in practice</em>
are effectively feed-forward.</strong> This could happen either because truncated
backpropagation time cannot learn patterns significantly longer than $k$ steps,
or, more provocatively, because models <em>trainable by gradient descent</em> cannot
have long-term memory.</p>
<p>In <a href="https://arxiv.org/abs/1805.10369">our recent paper</a>, we study the gap
between recurrent and feed-forward models trained using gradient descent. We
show if the recurrent model is <em>stable</em> (meaning the gradients can not explode),
then the model can be well-approximated by a feed-forward network for the
purposes of both <em>inference and training.</em> In other words, we show feed-forward
and stable recurrent models trained by gradient descent are <em>equivalent</em> in the
sense of making identical predictions at test-time. Of course, not all models
trained in practice are stable. We also give empirical evidence the stability
condition can be imposed on certain recurrent models without loss in
performance.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Despite some initial attempts, there is still much to do to understand
why feed-forward models are competitive with recurrent ones and
shed light onto the trade-offs between sequence models. How much memory is
really needed to perform well on common sequence benchmarks? What are the
expressivity trade-offs between truncated RNNs (which can be considered
feed-forward) and the convolutional models that are in popular use? Why can
feed-forward networks perform as well as unstable RNNs in practice?</p>
<p>Answering these questions is a step towards building a theory that can both
explain the strengths and limitations of our current methods and give guidance
about how to choose between different classes of models in concrete settings.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The Transformer isn’t strictly a feed-forward model in the style described above (since it doesn’t make the $k$ step conditional independence assumption), but is not really a recurrent model because it doesn’t maintain a hidden state. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 27 Jul 2018 01:00:00 -0700
http://offconvex.github.io/2018/07/27/approximating-recurrent/
http://offconvex.github.io/2018/07/27/approximating-recurrent/Deep-learning-free Text and Sentence Embedding, Part 2<p>This post continues <a href="http://www.offconvex.org/2018/06/17/textembeddings/">Sanjeev’s post</a> and describes further attempts to construct elementary and interpretable text embeddings.
The previous post described the <a href="https://openreview.net/pdf?id=SyK00v5xx">the SIF embedding</a>, which uses a simple weighted combination of word embeddings combined with some mild “denoising” based upon singular vectors, yet outperforms many deep learning based methods, including <a href="https://arxiv.org/pdf/1506.06726.pdf">Skipthought</a>, on certain downstream NLP tasks such as sentence semantic similarity and entailment.
See also this <a href="http://nlp.town/blog/sentence-similarity/">independent study by Yves Peirsman</a>.</p>
<p>However, SIF embeddings embeddings ignore word order (similar to classic <em>Bag of Words</em> models in NLP), which leads to unexciting performance on many other downstream classification tasks.
(Even the denoising via SVD, which is crucial in similarity tasks, can sometimes reduces performance on other tasks.)
Can we design a text embedding with the simplicity and transparency of SIF while also incorporating word order information?
Our <a href="https://openreview.net/pdf?id=B1e5ef-C-">ICLR’18 paper</a> with Kiran Vodrahalli does this, and achieves strong empirical performance and also some surprising theoretical guarantees stemming from the <a href="https://en.wikipedia.org/wiki/Compressed_sensing">theory of compressed sensing</a>.
It is competitive with all pre-2018 LSTM-based methods on standard tasks.
Even better, it is much faster to compute, since it uses pretrained (GloVe) word vectors and simple linear algebra.</p>
<p style="text-align:center;">
<img src="/assets/unsupervised_pipeline.png" width="50%" alt="Pipeline" />
</p>
<h2 id="incorporating-local-word-order-n-gram-embeddings">Incorporating local word order: $n$-gram embeddings</h2>
<p><em>Bigrams</em> are ordered word-pairs that appear in the sentence, and $n$-grams are ordered $n$-tuples.
A document with $k$ words has $k-1$ bigrams and $k-n+1$ $n$-grams.
The <em>Bag of n-gram (BonG) representation</em> of a document refers to a long vector whose each entry is indexed by all possible $n$-grams, and contains the number of times the corresponding $n$-gram appears in the document.
Linear classifiers trained on BonG representations are a <a href="https://www.aclweb.org/anthology/P12-2018">surprisingly strong baseline for document classification tasks</a>.
While $n$-grams don’t directly encode long-range dependencies in text, one hopes that a fair bit of such information is implicitly present.</p>
<p>A trivial idea for incorporating $n$-grams into SIF embeddings would be to treat $n$-grams like words, and compute word embeddings for them using either GloVe and word2vec.
This runs into the difficulty that the number of distinct $n$-grams in the corpus gets very large even for $n=2$ (let alone $n=3$), making it almost impossible to solve word2vec or GloVe.
Thus one gravitates towards a more <em>compositional</em> approach.</p>
<blockquote>
<p><strong>Compositional $n$-gram embedding:</strong> Represent $n$-gram $g=(w_1,\dots,w_n)$ as the element-wise product $v_g=v_{w_1}\odot\cdots\odot v_{w_n}$ of the embeddings of its constituent words.</p>
</blockquote>
<p>Note that due to the element-wise multiplication we actually represent unordered $n$-gram information, not ordered $n$-grams (the performance for order-preserving methods is about the same).
Now we are ready to define our <em>Distributed Co-occurrence (DisC) embeddings</em>.</p>
<blockquote>
<p>The <strong>DisC embedding</strong> of a piece of text is just a concatenation for $(v_1, v_2, \ldots)$ where $v_n$ is the sum of the $n$-gram embeddings of all $n$-grams in the document (for $n=1$ this is just the sum of word embeddings).</p>
</blockquote>
<p>Note that DisC embeddings leverage classic Bag-of-n-Gram information as well as the power of word embeddings.
For instance, the sentences <em>“Loved this movie!”</em> and <em>“I enjoyed the film.”</em> share no $n$-gram information for any $n$, but their DisC embeddings are fairly similar.
Thus if the first example comes with a label, it gives the learner some idea of how to classify the second.
This can be useful especially in settings with few labeled examples; e.g. DisC outperform BonG on the Stanford Sentiment Treebank (SST) task, which has only 6,000 labeled examples.
DisC embeddings also beat SIF and a standard LSTM-based method, Skipthoughts.
On the much larger IMDB testbed, BonG still reigns at top (although DisC is not too far behind).</p>
<div style="text-align:center;">
<img src="/assets/clfperf_sst_imdb.png" width="70%" alt="Performance on SST and IMDB" />
</div>
<p>Skip-thoughts does match or beat our DisC embeddings on some other classification tasks, but that’s still not too shabby an outcome for such a simple method. (By contrast, LSTM methods can take days or weeks of training, and are quite slow to evaluate at test time on a new piece of text.)</p>
<div style="text-align:center;">
<img src="/assets/sentenceembedtable.jpg" width="70%" alt="Performance on various classification tasks" />
</div>
<h2 id="some-theoretical-analysis-via-compressed-sensing">Some theoretical analysis via compressed sensing</h2>
<p>A linear SIF-like embedding represents a document with Bag-of-Words vector $x$ as
<script type="math/tex">\sum_w \alpha_w x_w v_w,</script>
where $v_w$ is the embedding of word $w$ and $\alpha_w$ is a scaling term.
In other words, it represents document $x$ as $A x$ where $A$ is the matrix with as many columns as the number of words in the language, and the column corresponding to word $w$ is $\alpha_w A$.
Note that $x$ has many zero coordinates corresponding to words that don’t occur in the document; in other words it’s a <em>sparse</em> vector.</p>
<p>The starting point of our DisC work was the realization that perhaps the reason SIF-like embeddings work reasonably well is that they <em>preserve</em> the Bag-of-words information, in the sense that it may be possible to <em>easily recover</em> $x$ from $A$.
This is not an outlandish conjecture at all, because <a href="https://en.wikipedia.org/wiki/Compressed_sensing"><em>compressed sensing</em></a> does exactly this when $x$ is suitably sparse and matrix $A$ has some nice properties such as RIP or incoherence.
A classic example is when $A$ is a random matrix, which in our case corresponds to using random vectors as word embeddings.
Thus one could try to use random word embeddings instead of GloVe vectors in the construction and see what happens!
Indeed, we find that so long as we raise the dimension of the word embeddings, then text embeddings using random vectors do indeed converge to the performance of BonG representations.</p>
<p>This is a surprising result, as compressed sensing does not imply this per se, since the ability to reconstruct the BoW vector from its compressed version doesn’t directly imply that the compressed version gives the same performance as BoW on linear classification tasks.
However, a result of <a href="https://pdfs.semanticscholar.org/627c/14fe9097d459b8fd47e8a901694198be9d5d.pdf">Calderbank, Jafarpour, & Schapire</a> shows that the compressed sensing condition that implies optimal recovery also implies good performance on linear classification under compression. Intuitively, this happens because of two facts.</p>
<p><script type="math/tex">\mbox{1) Optimum linear classifier $c^*$ is convex combination of datapoints.} \quad c^* = \sum_{i}\alpha_i x_i.</script>
<script type="math/tex">% <![CDATA[
\mbox{(2) RIP condition implies} <Ax, Ax'> \approx <x, x'>~\mbox{for $k$-sparse}~x, x'. %]]></script></p>
<p>Furthermore, by extending these ideas to the $n$-gram case, we show that our DisC embeddings computed using random word vectors, which can be seen as a linear compression of the BonG representation, can do as well as the original BonG representation on linear classification tasks. To do this we prove that the “sensing” matrix $A$ corresponding to DisC embeddings satisfy the <em>Restricted Isometry Property (RIP)</em> introduced in the seminal paper of <a href="https://statweb.stanford.edu/~candes/papers/DecodingLP.pdf">Candes & Tao</a>. The theorem relies upon <a href="http://www.cis.pku.edu.cn/faculty/vision/zlin/A%20Mathematical%20Introduction%20to%20Compressive%20Sensing.pdf">compressed sensing results for bounded orthonormal systems</a> and says that then the performance of DisC embeddings on linear classification tasks approaches that of BonG vectors as we increase the dimension.
Please see our paper for details of the proof.</p>
<p>It is worth noting that our idea of composing objects (words) represented by random vectors to embed structures ($n$-grams/documents) is closely related to ideas in neuroscience and neural coding proposed by <a href="http://www2.fiit.stuba.sk/~kvasnicka/CognitiveScience/6.prednaska/plate.ieee95.pdf">Tony Plate</a> and <a href="http://www.rctn.org/vs265/kanerva09-hyperdimensional.pdf">Pentti Kanerva</a>.
They also were interested in how these objects and structures could be recovered from the representations;
we take the further step of relating the recoverability to performance on a downstream linear classification task.
Text classification over compressed BonG vectors has been proposed before by <a href="https://papers.nips.cc/paper/4932-compressive-feature-learning.pdf">Paskov, West, Mitchell, & Hastie</a>, albeit with a more complicated compression that does not achieve a low-dimensional representation (dimension >100,000) due to the use of classical lossless algorithms rather than linear projection.
Our work ties together these ideas of composition and compression into a simple text representation method with provable guarantees.</p>
<h2 id="a-surprising-lower-bound-on-the-power-of-lstm-based-text-representations">A surprising lower bound on the power of LSTM-based text representations</h2>
<p>The above result also leads to a new theorem about deep learning: <em>text embeddings computed using low-memory LSTMs can do at least as well as BonG representations on downstream classification tasks</em>.
At first glance this result may seem uninteresting: surely it’s no surprise that the field’s latest and greatest method is at least as powerful as its oldest?
But in practice, most papers on LSTM-based text embeddings make it a point to compare to performance of BonG baseline, and <em>often are unable to improve upon that baseline</em>!
Thus empirically this new theorem had not been clear at all! (One reason could be that our theory requires the random embeddings to be somewhat higher dimensional than the LSTM work had considered.)</p>
<p>The new theorem follows from considering an LSTM that uses random vectors as word embeddings and computes the DisC embedding in one pass over the text. (For details see our appendix.)</p>
<p>We empirically tested the effect of dimensionality by measuring performance of DisC on IMDb sentiment classification.
As our theory predicts, the accuracy of DisC using random word embeddings converges to that of BonGs as dimensionality increases. (In the figure below “Rademacher vectors” are those with entries drawn randomly from $\pm1$.) Interestingly we also find that DisC using pretrained word embeddings like GloVe reaches BonG performance at much smaller dimensions, an unsurprising but important point that we will discuss next.</p>
<div style="text-align:center;">
<img src="/assets/imdbperf_uni_bi.png" width="60%" />
</div>
<h2 id="unexplained-mystery-higher-performance-of-pretrained-word-embeddings">Unexplained mystery: higher performance of pretrained word embeddings</h2>
<p>While compressed sensing theory is a good starting point for understanding the power of linear text embeddings, it leaves some mysteries.
Using pre-trained embeddings (such as GloVe) in DisC gives higher performance than random embeddings, both in recovering the BonG information out of the text embedding, as well as in downstream tasks. However, pre-trained embeddings do not satisfy some of the nice properties assumed in compressed sensing theory such as RIP or incoherence, since those properties forbid pairs of words having similar embeddings.</p>
<p>Even though the matrix of embeddings does not satisfy these classical compressed sensing properties, we find that using Basis Pursuit, a sparse recovery approach related to LASSO with provable guarantees for RIP matrices, we can recover Bag-of-Words information better using GloVe-based text embeddings than from embeddings using random word vectors (measuring success via the $F_1$-score of the recovered words — higher is better).</p>
<div style="text-align:center;">
<img src="/assets/recovery.png" width="60%" />
</div>
<p>Note that random embeddings are better than pretrained embeddings at recovering words from random word salad (the right-hand image).
This suggests that pretrained embeddings are specialized — thanks to their training on a text corpus — to do well only on real text rather than a random collection of words.
It would be nice to give a mathematical explanation for this phenomenon.
We suspect that this should be possible using a result of <a href="http://www.pnas.org/content/pnas/102/27/9446.full.pdf">Donoho & Tanner</a>, which we use to show that words in a document can be recovered from the sum of word vectors if and only if there is a hyperplane containing the vectors for words in the document with the vectors for all other words on one side of it.
Since co-occurring words will have similar embeddings, that should make it easier to find such a hyperplane separating words in a document from the rest of the words and hence would ensure good recovery.</p>
<p>However, even if this could be made more rigorous, it would only imply sparse recovery, not good performance on classification tasks.
Perhaps assuming a generative model for text, like the RandWalk model discussed in an <a href="https://www.offconvex.org/2016/02/14/word-embeddings-2/">earlier post</a>, could help move this theory forward.</p>
<h2 id="discussion">Discussion</h2>
<p>Could we improve the performance of such simple embeddings even further?
One promising idea is to define better $n$-gram embeddings than the simple compositional embeddings defined in DisC.
An independent <a href="https://arxiv.org/abs/1703.02507">NAACL’18 paper</a> of Pagliardini, Gupta, & Jaggi proposes a text embedding similar to DisC in which unigram and bigram embeddings are trained specifically to be added together to form sentence embeddings, also achieving good results, albeit not as good as DisC.
(Of course, their training time is higher than ours.)
In our upcoming <a href="https://arxiv.org/abs/1805.05388">ACL’18 paper</a> with Yingyu Liang, Tengyu Ma, & Brandon Stewart we give a very simple and efficient method to induce embeddings for $n$-grams as well as other rare linguistic features that improves upon DisC and beats skipthought on several other benchmarks.
This will be described in a future blog post.</p>
<p>Sample code for constructing and evaluating DisC embeddings is <a href="https://github.com/NLPrinceton/text_embedding">available</a>, as well as <a href="https://github.com/NLPrinceton/sparse_recovery">solvers</a> for recreating the sparse recovery results for word embeddings.</p>
Mon, 25 Jun 2018 03:00:00 -0700
http://offconvex.github.io/2018/06/25/textembeddings/
http://offconvex.github.io/2018/06/25/textembeddings/Deep-learning-free Text and Sentence Embedding, Part 1<p>Word embeddings (see my old <a href="http://www.offconvex.org/2015/12/12/word-embeddings-1/">post1</a> and
<a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">post2</a>) capture the idea that one can express “meaning” of words using a vector, so that the cosine of the angle between the vectors captures semantic similarity. (“Cosine similarity” property.) Sentence embeddings and text embeddings try to achieve something similar: use a fixed-dimensional vector to represent a small piece of text, say a sentence or a small paragraph. The performance of such embeddings can be tested via the Sentence Textual Similarity (STS) datasets (see the <a href="http://ixa2.si.ehu.es/stswiki/index.php/Main_Page">wiki page</a>), which contain sentence pairs humanly-labeled with similarity ratings.</p>
<p style="text-align:center;">
<img src="/assets/textembeddingvectorslide.jpg" width="30%" alt="What are text embeddings." />
</p>
<p>A general hope behind computing text embeddings is that they can be learnt using a large <em>unlabeled</em> text corpus (similar to word embeddings) and then allow good performance on downstream classification tasks with few <em>labeled</em> examples. Thus the overall pipeline could look like this:</p>
<p style="text-align:center;">
<img src="/assets/textembeddingpipeline.jpg" width="80%" alt="How are text embeddings used in downstream classification task." />
</p>
<p>Computing such representations is a form of <a href="http://www.offconvex.org/2017/06/26/unsupervised1/">representation learning as well as unsupervised learning</a>. This post will be an introduction to <strong>extremely simple</strong> ways of computing sentence embeddings, which on many standard tasks, beat many state-of-the-art deep learning methods. This post is based upon <a href="https://openreview.net/pdf?id=SyK00v5xx">my ICLR’17 paper on SIF embeddings</a> with Yingyu Liang and Tengyu Ma.</p>
<h2 id="existing-methods">Existing methods</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=2133826">Topic modeling</a> is a classic technique for unsupervised learning on text and it also yields a vector representation for a paragraph (or longer document), specifically, the vector of “topics” occuring in this document and their relative proportions. Unfortunately, topic modeling is not accurate at producing good representations at the sentence or short paragraph level, and furthermore there appears to be no variant of topic modeling that leads to the good cosine similarity property that we desire.</p>
<p><em>Recurrent neural net</em> is the default deep learning technique to train a <a href="https://www.tensorflow.org/tutorials/recurrent">language model</a>. It scans the text from left to right, maintaining a fixed-dimensional vector-representation of the text it has seen so far. It’s goal is to use this representation to predict the next word at each time step, and the training objective is to maximise log-likelihood of the data (or similar). Thus for example, a well-trained model when given a text fragment <em>“I went to the cafe and ordered a ….”</em> would assign high probability to <em>“coffee”, “croissant”</em> etc. and low probability to <em>“puppy”</em>. Myriad variations of such language models exist, many using biLSTMs which have some long-term memory and can scan the text forward and backwards. Lately biLSTMs have been replaced by convolutional architectures with attention mechanism; see for instance <a href="http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf">this paper</a>.</p>
<p>One obtains a text representation by peeking at the internal representation (i.e., node activations) at the top layer of this deep model. After all, when the model is scanning through text, its ability to predict the next word must imply that this internal representation implicitly captures a gist of all it has seen, reflecting rules of grammar, common-sense etc. (e.g., that you don’t order a puppy at a cafe). Some notable modern efforts along such lines are <a href="https://arxiv.org/abs/1506.01057">Hierarchichal Neural Autoencoder of Li et al.</a> as well as <a href="https://arxiv.org/abs/1502.06922">Palangi et al</a>, and <a href="https://arxiv.org/abs/1506.06726"><em>Skipthought</em> of Kiros et al.</a>.</p>
<p>As with all deep learning models, one wishes for interpretability: what information exactly did the machine choose to put into the text embedding? Besides <a href="https://people.csail.mit.edu/beenkim/papers/BeenK_FinaleDV_ICML2017_tutorial.pdf">the usual reasons for seeking interpretability</a>, in an NLP context it may help us leverage additional external resources such as <a href="https://wordnet.princeton.edu/">WordNet</a> in the task. Other motivations include
transfer learning/domain adaptation (to solve classification tasks for a small text corpus, leverage text embeddings trained on a large unrelated corpus).</p>
<h2 id="surprising-power-of-simple-linear-representations">Surprising power of simple linear representations</h2>
<p>In practice, many NLP applications rely on a simple sentence embedding: the average of the embeddings of the words in it. This makes some intuitive sense, because recall that the <a href="https://arxiv.org/pdf/1310.4546.pdf">Word2Vec paper</a> uses the following expression (in the their simpler CBOW word embedding)</p>
<script type="math/tex; mode=display">\Pr[w~|~w_1,w_2, w_3, w_4, w_5] \propto \exp(v_w \cdot (\frac{1}{5} \sum_i v_{w_i}). \qquad (1)</script>
<p>which suggests that the sense of a sequence of words is captured via simple average of word vectors.</p>
<p>While this simple average has only fair performance in capturing sentence similarity via cosine similarity, it can be quite powerful in downstream classification tasks (after passing through a single layer neural net) as shown in a
surprising paper of <a href="https://arxiv.org/abs/1511.08198">Wieting et al. ICLR’16</a>.</p>
<h2 id="better-linear-representation-sif-embeddings">Better linear representation: SIF embeddings</h2>
<p>My <a href="https://openreview.net/pdf?id=SyK00v5xx">ICLR’17 paper</a> with Yingyu Liang and Tengyu Ma improved such simple averaging using our <strong>SIF</strong> embeddings. They’re motivated by the empirical observation that word embeddings have various pecularities stemming from the training method, which tries to capture word cooccurence probabilities using vector inner product, and words sometimes occur out of context in documents. These anomalies cause the average of word vectors to have nontrivial components along semantically meaningless directions. SIF embeddings try to combat this in two ways, which I describe intuitively first, followed by more theoretical justification.</p>
<p><strong>Idea 1: Nonuniform weighting of words.</strong>
Conventional wisdom in information retrieval holds that “frequent words carry less signal.” Usually this is captured via <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF weighting</a>, which assigns weightings to words inversely proportional to their frequency. We introduce a new variant we call <em>Smoothed Inverse Frequency</em> (SIF) weighting,
which assigns to word $w$ a weighting $\alpha_w = a/(a+ p_w)$ where $p_w$ is the frequency of $w$ in the corpus and $a$ is a hyperparameter. Thus the embedding of a piece of text is $\sum_w \alpha_w v_w$ where the sum is over words in it.
(Aside: word frequencies can be estimated from any sufficiently large corpus; we find embedding quality to be not too dependent upon this.)</p>
<p>On a related note, we found that folklore understanding of word2vec, viz., expression (1), is <em>false.</em> A dig into the code reveals a resampling trick that is tantamount to a weighted average quite similar to our SIF weighting. (See Section 3.1 in our paper for a discussion.)</p>
<p><strong>Idea 2: Remove component from top singular direction.</strong>
The next idea is to modify the above weighted average by removing the component in a special direction, corresponding to the top singular direction set of weighted embeddings of a smallish sample of sentences from the domain (if doing domain adaptation, component is computed using sentences of the target domain). The paper notes that the direction corresponding to the top singular vector tends to contain information related to grammar and stop words, and removing the component in this subspace really cleans up the text embedding’s ability to express meaning.</p>
<h2 id="theoretical-justification">Theoretical justification</h2>
<p>A notable part of our paper is to give a theoretical justification for this weighting using a generative model for text similar to one used in our <a href="http://aclweb.org/anthology/Q16-1028">word embedding paper in TACL’16</a> as described in <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/">my old post</a>.
That model tries to give the causative relationship between word meanings and their cooccurence probabilities. It thinks of corpus generation as a dynamic process, where the $t$-th word is produced at step $t$. The model says that the process is driven by the random walk of a <em>discourse</em> vector $c_t \in \Re^d$. It is a unit vector whose direction in space represents <em>what is being talked about.</em>
Each word has a (time-invariant) latent vector $v_w \in \Re^d$ that captures its correlations with the discourse vector. We model this bias with a loglinear word production model:</p>
<script type="math/tex; mode=display">\Pr[w~\mbox{emitted at time $t$}~|~c_t] \propto \exp(c_t\cdot v_w). \qquad (2)</script>
<p>The discourse vector does a slow geometric random walk over the unit sphere in $\Re^d$. Thus $c_{t+1}$ is obtained by a small random displacement from $c_t$. Since expression (2) places much higher probability on words that are clustered around $c_t$, and $c_t$ moves slowly. If the discourse vector moves slowly, then we can assume a single discourse vector gave rise to the entire sentence or short paragraph. Thus given a sentence, a plausible vector representation of its “meaning” is a <em>max a posteriori</em> (MAP) estimate of the discourse vector that generated it.</p>
<p>Such models have been empirically studied for a while, but our paper gave a theoretical analysis, and showed that various subcases imply standard word embedding methods such as word2vec and GloVe. For example, it shows that MAP estimate of the discourse vector is the simple average of the embeddings of the preceding $k$ words – in other words, the average word vector!</p>
<p>This model is clearly simplistic and our ICLR’17 paper suggests two correction terms, intended to account for words occuring out of context, and to allow some common words (<em>“the”, “and”, “but”</em> etc.) appear often regardless of the discourse. We first introduce an additive term $\alpha p(w)$ in the log-linear model, where $p(w)$ is the unigram probability (in the entire corpus) of word and $\alpha$ is a scalar. This allows words to occur even if their vectors have very low inner products with $c_s$.
Secondly, we introduce a common discourse vector $c_0\in \Re^d$ which serves as a correction term for the most frequent discourse that is often related to syntax. It boosts the co-occurrence probability of words that have a high component along $c_0$.(One could make other correction terms, which are left to future work.) To put it another way, words that need to appear a lot out of context can do so by having a component along $c_0$, and the size of this component controls its probability of appearance out of context.</p>
<p>Concretely, given the discourse vector $c_s$ that produces sentence $s$, the probability of a word $w$ is emitted in the sentence $s$ is modeled as follows, where $\tilde{c}_{s} = \beta c_0 + (1-\beta) c_s, c_0 \perp c_s$,
$\alpha$ and $\beta$ are scalar hyperparameters:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Pr[w \mid s] = \alpha p(w) + (1-\alpha) \frac{\exp(<\tilde{c}_{s}, v_w>)}{Z_{\tilde{c,s}}}, %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
Z_{\tilde{c,s}} = \sum_{w} \exp(<\tilde{c}_{s}, v_w>) %]]></script>
<p>is the normalizing constant (the partition function). We see that the model allows a word $w$ unrelated to the discourse $c_s$ to be emitted for two reasons: a) by chance from the term $\alpha p(w)$; b) if $w$ is correlated with the common direction $c_0$.</p>
<p>The paper shows that the MAP estimate of the $c_s$ vector corresponds to the SIF embeddings described earlier, where the top singular vector used in their construction is an estimate of the $c_0$ vector in the model.</p>
<h2 id="empirical-performance">Empirical performance</h2>
<p>The performance of this embedding scheme appears in the figure below. Note that Wieting et al. had already shown that their method (which is semi-supervised, relying upon a large unannotated corpus and a small annotated corpus) beats many LSTM-based methods. So this table only compares to their work; see the papers for comparison with more past work.</p>
<p style="text-align:center;">
<img src="/assets/textembedexperiments.jpg" width="80%" alt="Performance of our embedding on downstream classification tasks" />
</p>
<p>For other performance results please see the paper.</p>
<h2 id="next-post">Next post</h2>
<p>In the next post, I will sketch improvements to the above embedding in two of our new papers. Special guest appearance: Compressed Sensing (aka Sparse Recovery).</p>
<p>The SIF embedding package is available <a href="https://github.com/PrincetonML/SIF">from our github page</a></p>
Sun, 17 Jun 2018 03:00:00 -0700
http://offconvex.github.io/2018/06/17/textembeddings/
http://offconvex.github.io/2018/06/17/textembeddings/Limitations of Encoder-Decoder GAN architectures<p>This is yet another post about <a href="http://www.offconvex.org/2017/03/15/GANs/">Generative Adversarial Nets (GANs)</a>, and based upon our new <a href="https://openreview.net/forum?id=BJehNfW0-">ICLR’18 paper</a> with Yi Zhang. A quick recap of the story so far. GANs are an unsupervised method in deep learning to learn interesting distributions (e.g., images of human faces), and also have a plethora of uses for image-to-image mappings in computer vision. Standard GANs training is motivated using this task of distribution learning, and is designed with the idea that given large enough deep nets and enough training examples, as well as accurate optimization, GANs will learn the full distribution.</p>
<p><a href="http://www.offconvex.org/2017/03/30/GANs2/">Sanjeev’s previous post</a> concerned <a href="https://arxiv.org/abs/1703.00573">his co-authored ICML’17 paper</a> which called this intuition into question when the deep nets have finite capacity. It shows that the training objective has near-equilibria where the discriminator is fooled —i.e., training objective is good—but the generator’s distributions has very small support, i.e. shows <em>mode collapse.</em> This is a failure of the model, and raises the question whether such bad equilibria are found in real-life training. A <a href="http://www.offconvex.org/2017/07/07/GANs3/">second post</a> showed empirical evidence that they do, using the birthday-paradox test.</p>
<p>The current post concerns our <a href="https://arxiv.org/abs/1711.02651">new result</a> (part of our upcoming <a href="https://openreview.net/forum?id=BJehNfW0-">ICLR paper</a>) which shows that bad equilibria exist also in more recent GAN architectures based on simultaneously learning an <em>encoder</em> and <em>decoder</em>. This should be surprising because many researchers believe that encoder-decoder architectures fix many issues with GANs, including mode collapse.</p>
<p>As we will see, encoder-decoder GANs seem very powerful. In particular, the proof of the previously mentioned <a href="http://www.offconvex.org/2017/03/30/GANs2/">negative result</a> utterly breaks down for this architecture. But, we then discovered a cute argument that shows encoder-decoder GANs can have poor solutions, featuring not only mode collapse but also encoders that map images to nonsense (more precisely Gaussian noise). This is the worst possible failure of the model one could imagine.</p>
<h2 id="encoder-decoder-architectures">Encoder-decoder architectures</h2>
<p>Encoders and decoders have long been around in machine learning in various forms – especially deep learning. Speaking loosely, underlying all of them are two basic assumptions: <br />
(1) Some form of the so-called <a href="https://mitpress.mit.edu/sites/default/files/titles/content/9780262033589_sch_0001.pdf"><em>manifold assumption</em></a> which asserts that high-dimensional data such as real-life images lie (roughly) on a low-dimensional manifold. (“Manifold” should be interpreted rather informally – sometimes this intuition applies only very approximately sometimes it’s meant in a “distributional” sense, etc.) <br />
(2) The low-dimensional structure is “meaningful”: if we think of an image $x$ as a high-dimensional vector and its “code” $z$ as its coordinates on the low-dimensional manifold, the code $z$ is thought of as a “high-level” descriptor of the image.</p>
<p>With the above two points in mind, an <em>encoder</em> maps the image to its code, and a <em>decoder</em> computes the reverse map. (We also discussed encoders and decoders in <a href="http://www.offconvex.org/2017/06/27/unsupervised1/">our earlier post on representation learning</a> in a more general setup.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_manifold2.jpg" width="80%" alt="Manifold structure" />
</p>
<h2 id="encoder-decoder-gans">Encoder-Decoder GANs</h2>
<p>These were introduced by <a href="https://arxiv.org/abs/1606.00704">Dumoulin et al.(ALI)</a> and <a href="https://arxiv.org/abs/1605.09782">Donahue et al.(BiGAN)</a>. They involve two competitors: Player 1 involves a discriminator net $D$ that is given an input of the form (image, code) and it outputs a number in the interval $[0,1]$, which denotes its “satisfaction level” with this input. Player 2 trains a decoder net $G$ (also called <em>generator</em> in the GANs setting) and an encoder net $E$.</p>
<p>Recall that in the standard GAN, discriminator tries to distinguish real images from images generated by the generator $G$. Here
discriminator’s input is an image and its code. Specifically, Player 1 is trying to train its net to distinguish between the following two settings, and Player 2 is trying to make sure the two settings look indistinguishable to Player 1’s net.</p>
<p><script type="math/tex">\mbox{Setting 1: presented with}~(x, E(x))~\mbox{where $x$ is random real image}.</script>
<script type="math/tex">\mbox{Setting 2: presented with}~(G(z), z)~\mbox{where $z$ is random code}.</script></p>
<p>(Here it is assumed that a random code is a vector with i.i.d gaussian coordinates, though one could consider other distributions.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_2settings_v2.jpg" width="80%" alt="Two settings which discriminator net has to distinguish between" />
</p>
<p>The hoped-for equilibrium obviously is one where generator and encoder are inverses of each other: $E(G(z)) \approx z$ and $G(E(x)) \approx x$, and the joint distributions $(z,G(z))$ and $(E(x), x)$ roughly match.
The underlying intuition is that if this happens, Player 1 must’ve produced a “meaningful” representation $E(x)$ for the images – and this should improve the quality of the generator as well.
Indeed, <a href="https://arxiv.org/abs/1606.00704">Dumoulin et al.(ALI)</a> provide some small-scale empirical examples on mixtures of Gaussians for which encoder-decoder architectures seem to ameliorate the problem of mode collapse.</p>
<p>The above papers prove that when the encoder/decoder/discriminator have infinite capacity, the desired solution is indeed an equilibrium. However, we’ll see that things are very different when capacities are finite.</p>
<h2 id="finite-capacity-discriminators-are-weak">Finite-capacity discriminators are weak</h2>
<p>Say a generator/encoder pair $(G,E)$ $\epsilon$-<em>fools</em> a decoder $D$ if</p>
<script type="math/tex; mode=display">|E_{x} D(x, E(x)) - E_{z} D(G(z), z)| \leq \epsilon</script>
<p>In other words, $D$ has roughly similar output in Settings 1 and 2.</p>
<p>Our theorem applies when the distribution consists of realistic images, as explained later. We show the following:</p>
<blockquote>
<p>(Informal theorem) If the discriminator $D$ has capacity (i.e. number of parameters) at most $p$, then there is an encoder $E$ of capacity $\ll p$ and generator $G$ of slightly larger capacity than $p$ such that $(G, E)$ can $\epsilon$-fool every such $D$. Furthermore, the generator exhibits mode collapse: its distribution is essentially supported on a bit more than $p$ images, and the encoder $E$ just outputs white noise (i.e. does not extract any “meaningful” representation) given an image.</p>
</blockquote>
<p>(Note that such a $(G, E)$ represents an $\epsilon$-approximate equilibrium, in the sense that player 1 cannot gain more than $\epsilon$ in the distinguishing probability by switching its discriminator. )</p>
<p>It is important that the encoder’s capacity is much less than $p$, and thus the theorem allows a discriminator that is able to simulate $E$ if it needed, and in particular verify for a random seed $z$ that $E(G(z)) \approx z$. The theorem says that even the ability to conduct such a verification cannot give it power to force encoder to produce meaningful codes. This is a counterintuitive aspect of the result. The main difficulty in the proof (which stumped us for a bit) was how to exhibit such an equilibrium where $E$ is a small net.</p>
<p>This is ensured by a simple assumption. We assume the image distribution is mildly “noised”: say, every 100th pixel is replaced by Gaussian noise. To a human, such an image would of course be indistinguishable from a real image. (NB: Our proof could be carried out via some other assumptions to the effect that images have an innate stochastic/noise component that is efficiently extractable by a small neural network. But let’s keep things clean.) When noise $\eta$ is thus added to an image $x$, we denote the resulting image as $x \odot \eta$.</p>
<p>Now the encoder will be rather trivial: given the noised image $x \odot \eta$, output $\eta$. Clearly, such an encoder does not in any sense capture “meaning” in the image. It is also implementable by a tiny single-layer net, as required by the theorem.</p>
<h3 id="construction-of-generator">Construction of generator</h3>
<p>As usual in the GAN literature, we will assume the discriminator is $L$-<a href="https://www.encyclopediaofmath.org/index.php/Lipschitz_constant">Lipschitz</a>. This can be a loose upperbound, since only $\log L$ enters quantitatively in the proof.</p>
<p>The generator $G(z)$ in the theorem statement memorizes a hash function that partitions the set of all seeds/codes $z$ into $m$ equal-sized blocks; it also memorizes a “pool” of $m := p \log^2(pL)/ \epsilon^2$ unnoised images $\tilde{x}_1, \tilde{x}_2, \dots, \tilde{x}_m$. When presented with a random seed $z$, the generator computes the block of the partition that $z$ lies in, and then produces the image $\tilde{x}_i \odot z$, where $i$ is the block $z$ belongs to. (See the Figure below.)</p>
<p style="text-align:center;">
<img src="/assets/BIGAN_construction_2.jpg" width="50%" alt="The bad generator construction" />
</p>
<p>Now we have to prove that such a memorizing generator exists that $\epsilon$-fools all discriminators of capacity $p$. This is shown by the <a href="https://en.wikipedia.org/wiki/Probabilistic_method">probabilistic method</a>: we describe a distribution over generators $G$ that works “in expectation”, and subsequently use concentration bounds to prove there exists at least one generator that does the job.</p>
<p>The distribution on $G$’s is straightforward: we select the pool of (unnoised) images
$\tilde{x}_1, \tilde{x}_2, .., \tilde{x}_m$ at random. Why is this distribution for $G$ sensible? Notice the following simple fact:</p>
<script type="math/tex; mode=display">E_{G} E_{z} D(G(z), z) = E_{\tilde{x}, z} D(\tilde{x} \odot z, z) = E_{x} D(x, E(x)) \hspace{2cm} (3)</script>
<p>In other words, the “expected” encoder correctly matches the expectation of $D(x, E(x))$, so that the discriminator is fooled “in expectation”.
This of course is not enough: we need some kind of concentration argument to show a particular $G$ works against <em>all possible discriminators</em>, which will ultimately use the fact that the discriminator $D$ has a small capacity and small Lipschitz constant. (Think covering number arguments in learning theory.)</p>
<p>Towards that, another useful observation: if $q$ is the uniform distribution over sets $T= {z_1, z_2,\dots, z_m}$, s.t. each $z_i$ is independently sampled from the conditional distribution inside the $i$-th block of the partition of the noise space, by the law of total expectation one can see that
<script type="math/tex">E_{z} D(G(z), z) = E_{T \sim q} \frac{1}{m} \sum_{i=1}^m D(G(z_i), z_i)</script>
The right hand side is an average of terms, each of which is a bounded function of mutually independent random variables – so, by e.g. McDiarmid’s inequality it concentrates around it’s expectation, which by (3) is exactly $E_{z} D(G(z), z)$.</p>
<p>To finish the argument off, we use the fact that due to Lipschitzness and the bound on the number of parameters, the “effective” number of distinct discriminators is small, so we can union bound over them. (Formally, this translates to an epsilon-net + union bound argument. This also gives rise to the value of $m$ used in the construction.)</p>
<h2 id="takeaway">Takeaway</h2>
<p>The result should be interpreted as saying that possibly the theoretical foundations of GANs need more work. The current way of thinking about them as distribution learners may not be the right way to formalize them. Furthermore, one has to take care about transfering notions invented for distribution learning, such as encoders and decoders, over into the GANs setting. Finally there is an empirical question whether any of the <a href="https://deephunt.in/the-gan-zoo-79597dc8c347">myriad GANS variations</a> can avoid mode collapse.</p>
Mon, 12 Mar 2018 03:00:00 -0700
http://offconvex.github.io/2018/03/12/bigan/
http://offconvex.github.io/2018/03/12/bigan/Can increasing depth serve to accelerate optimization?<p>“How does depth help?” is a fundamental question in the theory of deep learning. Conventional wisdom, backed by theoretical studies (e.g. <a href="http://proceedings.mlr.press/v49/eldan16.pdf">Eldan & Shamir 2016</a>; <a href="http://proceedings.mlr.press/v70/raghu17a/raghu17a.pdf">Raghu et al. 2017</a>; <a href="http://proceedings.mlr.press/v65/lee17a/lee17a.pdf">Lee et al. 2017</a>; <a href="http://proceedings.mlr.press/v49/cohen16.pdf">Cohen et al. 2016</a>; <a href="http://proceedings.mlr.press/v65/daniely17a/daniely17a.pdf">Daniely 2017</a>; <a href="https://openreview.net/pdf?id=B1J_rgWRW">Arora et al. 2018</a>), holds that adding layers increases expressive power. But often this expressive gain comes at a price –optimization is harder for deeper networks (viz., <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">vanishing/exploding gradients</a>). Recent works on “landscape characterization” implicitly adopt this worldview (e.g. <a href="https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf">Kawaguchi 2016</a>; <a href="https://openreview.net/pdf?id=ryxB0Rtxx">Hardt & Ma 2017</a>; <a href="http://proceedings.mlr.press/v38/choromanska15.pdf">Choromanska et al. 2015</a>; <a href="http://openaccess.thecvf.com/content_cvpr_2017/papers/Haeffele_Global_Optimality_in_CVPR_2017_paper.pdf">Haeffele & Vidal 2017</a>; <a href="https://arxiv.org/pdf/1605.08361.pdf">Soudry & Carmon 2016</a>; <a href="https://arxiv.org/pdf/1712.08968.pdf">Safran & Shamir 2017</a>). They prove theorems about local minima and/or saddle points in the objective of a deep network, while implicitly assuming that the ideal landscape would be convex (single global minimum, no other critical point). My <a href="https://arxiv.org/pdf/1802.06509.pdf">new paper</a> with Sanjeev Arora and Elad Hazan makes the counterintuitive suggestion that sometimes, increasing depth can <em>accelerate</em> optimization.</p>
<p>Our work can also be seen as one more piece of evidence for a nascent belief that <em>overparameterization</em> of deep nets may be a good thing. By contrast, classical statistics discourages training a model with more parameters than necessary <a href="https://www.rasch.org/rmt/rmt222b.htm">as this can lead to overfitting</a>.</p>
<h2 id="ell_p-regression">$\ell_p$ Regression</h2>
<p>Let’s begin by considering a very simple learning problem - scalar linear regression with $\ell_p$ loss (our theory and experiments will apply to $p>2$):</p>
<script type="math/tex; mode=display">\min_{\mathbf{w}}~L(\mathbf{w}):=\frac{1}{p}\sum_{(\mathbf{x},y)\in{S}}(\mathbf{x}^\top\mathbf{w}-y)^p</script>
<p>$S$ here stands for a training set, consisting of pairs $(\mathbf{x},y)$ where $\mathbf{x}$ is a vector representing an instance and $y$ is a (numeric) scalar standing for its label; $\mathbf{w}$ is the parameter vector we wish to learn. Let’s convert the linear model to an extremely simple “depth-2 network”, by replacing the vector $\mathbf{w}$ with a vector $\mathbf{w_1}$ times a scalar $\omega_2$. Clearly, this is an overparameterization that does not change expressiveness, but yields the (non-convex) objective:</p>
<script type="math/tex; mode=display">\min_{\mathbf{w_1},\omega_2}~L(\mathbf{w_1},\omega_2):=\frac{1}{p}\sum_{(\mathbf{x},y)\in{S}}(\mathbf{x}^\top\mathbf{w_1}\omega_2-y)^p</script>
<p>We show in the paper, that if one applies gradient descent over $\mathbf{w_1}$ and $\omega_2$, with small learning rate and near-zero initialization (as customary in deep learning), the induced dynamics on the overall (<em>end-to-end</em>) model $\mathbf{w}=\mathbf{w_1}\omega_2$ can be written as follows:</p>
<script type="math/tex; mode=display">\mathbf{w}^{(t+1)}\leftarrow\mathbf{w}^{(t)}-\rho^{(t)}\nabla{L}(\mathbf{w}^{(t)})-\sum_{\tau=1}^{t-1}\mu^{(t,\tau)}\nabla{L}(\mathbf{w}^{(\tau)})</script>
<p>where $\rho^{(t)}$ and $\mu^{(t,\tau)}$ are appropriately defined (time-dependent) coefficients.
Thus the seemingly benign addition of a single multiplicative scalar turned plain gradient descent into a scheme that somehow has a memory of past gradients —the key feature of <a href="https://distill.pub/2017/momentum/">momentum</a> methods— as well as a time-varying learning rate. While theoretical analysis of the precise benefit of momentum methods is never easy, a simple experiment with $p=4$, on <a href="https://archive.ics.uci.edu/ml/index.php">UCI Machine Learning Repository</a>’s <a href="https://archive.ics.uci.edu/ml/datasets/gas+sensor+array+drift+dataset">“Gas Sensor Array Drift at Different Concentrations” dataset</a>, shows the following effect:</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_L4_exp.png" width="40%" alt="L4 regression experiment" />
</p>
<p>Not only did the overparameterization accelerate gradient descent, but it has done so more than two well-known, explicitly designed acceleration methods – <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">AdaGrad</a> and <a href="https://arxiv.org/pdf/1212.5701.pdf">AdaDelta</a> (the former did not really provide a speedup in this experiment). We observed similar speedups in other settings as well.</p>
<p>What is happening here? Can non-convex objectives corresponding to deep networks be easier to optimize than convex ones?
Is this phenomenon common or is it limited to toy problems as above?
We take a first crack at addressing these questions…</p>
<h2 id="overparameterization-decoupling-optimization-from-expressiveness">Overparameterization: Decoupling Optimization from Expressiveness</h2>
<p>A general study of the effect of depth on optimization entails an inherent difficulty - deeper networks may seem to converge faster due to their superior expressiveness.
In other words, if optimization of a deep network progresses more rapidly than that of a shallow one, it may not be obvious whether this is a result of a true acceleration phenomenon, or simply a byproduct of the fact that the shallow model cannot reach the same loss as the deep one.
We resolve this conundrum by focusing on models whose representational capacity is oblivious to depth - <em>linear neural networks</em>, the subject of many recent studies.
With linear networks, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices - an overparameterization.
Accordingly, if this leads to accelerated convergence, one can be certain that it is not an outcome of any phenomenon other than favorable properties of depth for optimization.</p>
<h2 id="implicit-dynamics-of-depth">Implicit Dynamics of Depth</h2>
<p>Suppose we are interested in learning a linear model parameterized by a matrix $W$, through minimization of some training loss $L(W)$.
Instead of working directly with $W$, we replace it by a depth $N$ linear neural network, i.e. we overparameterize it as $W=W_{N}W_{N-1}\cdots{W_1}$, with $W_j$ being weight matrices of individual layers.
In the paper we show that if one applies gradient descent over $W_{1}\ldots{W}_N$, with small learning rate $\eta$, and with the condition:</p>
<script type="math/tex; mode=display">W_{j+1}^\top W_{j+1} = W_j W_j^\top</script>
<p>satisfied at optimization commencement (note that this approximately holds with standard near-zero initialization), the dynamics induced on the overall end-to-end mapping $W$ can be written as follows:</p>
<script type="math/tex; mode=display">W^{(t+1)}\leftarrow{W}^{(t)}-\eta\sum_{j=1}^{N}\left[W^{(t)}(W^{(t)})^\top\right]^\frac{j-1}{N}\nabla{L}(W^{(t)})\left[(W^{(t)})^\top{W}^{(t)}\right]^\frac{N-j}{N}</script>
<p>We validate empirically that this analytically derived update rule (over classic linear model) indeed complies with deep network optimization, and take a series of steps to theoretically interpret it.
We find that the transformation applied to the gradient $\nabla{L}(W)$ (multiplication from the left by $[WW^\top]^\frac{j-1}{N}$, and from the right by $[W^\top{W}]^\frac{N-j}{N}$, followed by summation over $j$) is a particular preconditioning scheme, that promotes movement along directions already taken by optimization.
More concretely, the preconditioning can be seen as a combination of two elements:</p>
<ul>
<li>an adaptive learning rate that increases step sizes away from initialization; and</li>
<li>a “momentum-like” operation that stretches the gradient along the azimuth taken so far.</li>
</ul>
<p>An important point to make is that the update rule above, referred to hereafter as the <em>end-to-end update rule</em>, does not depend on widths of hidden layers in the linear neural network, only on its depth ($N$).
This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect - it is only the number of layers that matters.
Therefore, acceleration by depth need not be computationally demanding - a fact we clearly observe in our experiments (previous figure for example shows acceleration by orders of magnitude at the price of a single extra scalar parameter).</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_update_rule.png" width="60%" alt="End-to-end update rule" />
</p>
<h2 id="beyond-regularization">Beyond Regularization</h2>
<p>The end-to-end update rule defines an optimization scheme whose steps are a function of the gradient $\nabla{L}(W)$ and the parameter $W$.
As opposed to many acceleration methods (e.g. <a href="https://distill.pub/2017/momentum/">momentum</a> or <a href="https://arxiv.org/pdf/1412.6980.pdf">Adam</a>) that explicitly maintain auxiliary variables, this scheme is memoryless, and by definition born from gradient descent over something (overparameterized objective).
It is therefore natural to ask if we can represent the end-to-end update rule as gradient descent over some regularization of the loss $L(W)$, i.e. over some function of $W$.
We prove, somewhat surprisingly, that the answer is almost always negative - as long as the loss $L(W)$ does not have a critical point at $W=0$, the end-to-end update rule, i.e. the effect of overparameterization, cannot be attained via <em>any</em> regularizer.</p>
<h2 id="acceleration">Acceleration</h2>
<p>So far, we analyzed the effect of depth (in the form of overparameterization) on optimization by presenting an equivalent preconditioning scheme and discussing some of its properties.
We have not, however, provided any theoretical evidence in support of acceleration (faster convergence) resulting from this scheme.
Full characterization of the scenarios in which there is a speedup goes beyond the scope of our paper.
Nonetheless, we do analyze a simple $\ell_p$ regression problem, and find that whether or not increasing depth accelerates depends on the choice of $p$:
for $p=2$ (square loss) adding layers does not lead to a speedup (in accordance with previous findings by <a href="https://arxiv.org/pdf/1312.6120.pdf">Saxe et al. 2014</a>);
for $p>2$ it can, and this may be attributed to the preconditioning scheme’s ability to handle large plateaus in the objective landscape.
A number of experiments, with $p$ equal to 2 and 4, and depths ranging between 1 (classic linear model) and 8, support this conclusion.</p>
<h2 id="non-linear-experiment">Non-Linear Experiment</h2>
<p>As a final test, we evaluated the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting - the <a href="https://github.com/tensorflow/models/tree/master/tutorials/image/mnist">convolutional network tutorial for MNIST built into TensorFlow</a>.
We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer.
With an addition of roughly 15% in number of parameters, optimization accelerated by orders of magnitude:</p>
<p style="text-align:center;">
<img src="/assets/acc_oprm_cnn_exp.png" width="40%" alt="TensorFlow MNIST CNN experiment" />
</p>
<p>We note that similar experiments on other convolutional networks also gave rise to a speedup, but not nearly as prominent as the above.
Empirical characterization of conditions under which overparameterization accelerates optimization in non-linear settings is potentially an interesting direction for future research.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Our work provides insight into benefits of depth in the form of overparameterization, from the perspective of optimization.
Many open questions and problems remain.
For example, is it possible to rigorously analyze the acceleration effect of the end-to-end update rule (analogously to, say, <a href="http://www.cis.pku.edu.cn/faculty/vision/zlin/1983-A%20Method%20of%20Solving%20a%20Convex%20Programming%20Problem%20with%20Convergence%20Rate%20O(k%5E(-2))_Nesterov.pdf">Nesterov 1983</a> or <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">Duchi et al. 2011</a>)?
Treatment of non-linear deep networks is of course also of interest, as well as more extensive empirical evaluation.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Fri, 02 Mar 2018 05:00:00 -0800
http://offconvex.github.io/2018/03/02/acceleration-overparameterization/
http://offconvex.github.io/2018/03/02/acceleration-overparameterization/Proving generalization of deep nets via compression<p>This post is about <a href="https://arxiv.org/abs/1802.05296">my new paper with Rong Ge, Behnam Neyshabur, and Yi Zhang</a> which offers some new perspective into the generalization mystery for deep nets discussed in
<a href="http://www.offconvex.org/2017/12/08/generalization1/">my earlier post</a>. The new paper introduces an elementary compression-based framework for proving generalization bounds. It shows that deep nets are highly noise stable, and consequently, compressible. The framework also gives easy proofs (sketched below) of some papers that appeared in the past year.</p>
<p>Recall that the <strong>basic theorem</strong> of generalization theory says something like this: if training set had $m$ samples then the <em>generalization error</em> —defined as the difference between error on training data and test data (aka held out data)— is of the order of $\sqrt{N/m}$. Here $N$ is the number of <em>effective parameters</em> (or <em>complexity measure</em>) of the net; it is at most the actual number of trainable parameters but could be much less. (For ease of exposition this post will ignore nuisance factors like $\log N$ etc. which also appear in the these calculations.) The mystery is that networks with millions of parameters have low generalization error even when $m =50K$ (as in CIFAR10 dataset), which suggests that the number of true parameters is actually much less than $50K$. The papers <a href="https://arxiv.org/abs/1706.08498">Bartlett et al. NIPS’17</a> and <a href="https://openreview.net/forum?id=Skz_WfbCZ">Neyshabur et al. ICLR’18</a>
try to quantify the complexity measure using very interesting ideas like Pac-Bayes and Margin (which influenced our paper). But ultimately the quantitative estimates are fairly vacuous —orders of magnitude <em>more</em> than the number of <em>actual parameters.</em> By contrast our new estimates are several orders of magnitude better, and on the verge of being interesting. See the following bar graph on a log scale. (All bounds are listed ignoring “nuisance factors.” Number of trainable parameters is included only to indicate scale.)</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/acompare.png" width="75%" alt="comparison of bounds from various recent papers" />
</p>
<h2 id="the-compression-approach">The Compression Approach</h2>
<p>The compression approach takes a deep net $C$ with $N$ trainable parameters and tries to compress it to another one $\hat{C}$ that has (a) much fewer parameters $\hat{N}$ than $C$ and (b) has roughly the same training error as $C$.</p>
<p>Then the above basic theorem guarantees that so long as the number of training samples exceeds $\hat{N}$, then $\hat{C}$ <em>does</em> generalize well (even if $C$ doesn’t). An extension of this approach says that the same conclusions holds if we let the compression algorithm to depend upon an arbitrarily long <em>random string</em> provided this string is fixed in advance of seeing the training data. We call this <em>compression with respect to fixed string</em> and rely upon it.</p>
<p>Note that the above approach proves good generalization of the compressed $\hat{C}$, not the original $C$. (I suspect the ideas may extend to proving good generalization of the original $C$; the hurdles seem technical rather than inherent.) Something similar was true of earlier approaches using PAC-Bayes bounds, which also prove the generalization of some net related to $C$, not of $C$ itself. (Hence the tongue-in-cheek title of the classic reference <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford-Caruana2002</a>.)</p>
<p>Of course, in practice deep nets are well-known to be compressible using a slew of ideas—by factors of 10x to 100x; see <a href="https://arxiv.org/abs/1710.09282">the recent survey</a>. However, usually such compression involves <em>retraining</em> the compressed net. Our paper doesn’t consider retraining the net (since it involves reasoning about the loss landscape) but followup work should look at this.</p>
<h2 id="flat-minima-and-noise-stability">Flat minima and Noise Stability</h2>
<p>Modern generalization results can be seen as proceeding via some formalization of a <em>flat minimum</em> of the loss landscape. This was suggested in 1990s as the source of good generalization <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">Hochreiter and Schmidhuber 1995</a>. Recent empirical work of <a href="https://arxiv.org/abs/1609.04836">Keskar et al 2016</a> on modern deep architectures finds that flatness does correlate with better generalization, though the issue is complicated, as discussed in an upcoming post by Behnam Neyshabur.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aflatminima.png" width="65%" alt="Flat vs sharp minima" />
</p>
<p>Here’s the intuition why a flat minimum should generalize better, as originally articulated by <a href="http://www.cs.toronto.edu/~fritz/absps/colt93.pdf">Hinton and Camp 1993</a>. Crudely speaking, suppose a flat minimum is one that occupies “volume” $\tau$ in the landscape. (The flatter the minimum, the higher $\tau$ is.) Then the number of <em>distinct</em> flat minima in the landscape is at most $S =\text{total volume}/\tau$. Thus one can number the flat minima from $1$ to $S$, implying that a flat minimum can be represented using $\log S$ bits. The above-mentioned <em>basic theorem</em> implies that flat minima generalize if the number of training samples $m$ exceeds $\log S$.</p>
<p>PAC-Bayes approaches try to formalize the above intuition by defining a flat minimum as follows: it is a net $C$ such that adding appropriately-scaled gaussian noise to all its trainable parameters does not greatly affect the training error. This allows quantifying the “volume” above in terms of probability/measure (see
<a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">my lecture notes</a> or <a href="https://arxiv.org/abs/1703.11008">Dziugaite-Roy</a>) and yields some explicit estimates on sample complexity. However, obtaining good quantitative estimates from this calculation has proved difficut, as seen in the bar graph earlier.</p>
<p>We formalize “flat minimum” using noise stability of a slightly different form. Roughly speaking, it says that if we inject appropriately scaled gaussian noise at the output of some layer, then this noise gets attenuated as it propagates up to higher layers. (Here “top” direction refers to the output of the net.) This is obviously related to notions like dropout, though it arises also in nets that are not trained with dropout. The following figure illustrates how noise injected at a certain layer of VGG19 (trained on CIFAR10) affects the higher layer. The y-axis denote the magnitude of the noise ($\ell_2$ norm) as a multiple of the vector being computed at the layer, and shows how a single noise vector quickly attenuates as it propagates up the layers.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/attenuate.jpg" width="65%" alt="How noises attenuates as it travels up the layers of VGG." />
</p>
<p>Clearly, computation of the trained net is highly resistant to noise. (This has obvious implications for biological neural nets…)
Note that the training involved no explicit injection of noise (eg dropout). Of course, stochastic gradient descent <em>implicitly</em> adds noise to the gradient, and it would be nice to investigate more rigorously if the noise stability arises from this or from some other source.</p>
<h2 id="noise-stability-and-compressibility-of-single-layer">Noise stability and compressibility of single layer</h2>
<p>To understand why noise-stable nets are compressible, let’s first understand noise stability for a single layer in the net, where we ignore the nonlinearity. Then this layer is just a linear transformation, i.e., matrix $M$.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/alinear.png" width="40%" alt="matrix M describing a single layer" />
</p>
<p>What does it mean that this matrix’s output is stable to noise? Suppose the vector at the previous layer is a unit vector $x$. This is the output of the lower layers on an actual sample, so $x$ can be thought of as the “signal” for the current layer. The matrix converts $x$ into $Mx$. If we inject a noise vector $\eta$ of unit norm at $x$ then the output must become $M(x +\eta)$. We say $M$ is noise stable for input $x$ if such noising affects the output very little, which implies the norm of $Mx$ is much higher than that of $M \eta$.
The former is at most $\sigma_{max}(M)$, the largest singular value of $M$. The latter is approximately
$(\sum_i \sigma_i(M)^2)^{1/2}/\sqrt{h}$ where $\sigma_i(M)$ is the $i$th singular value of $M$ and $h$ is dimension of $Mx$. The reason is that gaussian noise divides itself evenly across all directions, with variance in each direction $1/h$.
We conclude that:
<script type="math/tex">(\sigma_{max}(M))^2 \gg \frac{1}{h} \sum_i (\sigma_i(M)^2),</script></p>
<p>which implies that the matrix has an uneven distribution of singular values. Ratio of left side and right side is called the <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> and is at most the linear algebraic rank. Furthermore, the above analysis suggests that the “signal” $x$ is <em>correlated</em> with the singular directions corresponding to the higher singular values, which is at the root of the noise stability.</p>
<p>Our experiments on VGG and GoogleNet reveal that the higher layers of deep nets—where most of the net’s parameters reside—do indeed exhibit a highly uneven distribution of singular values, and that the signal aligns more with the higher singular directions. The figure below describes layer 10 in VGG19 trained on CIFAR10.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aspectrumlayer10.png" width="45%" alt="distribution of singular values of matrix at layer 10 of VGG19" />
</p>
<h2 id="compressing-multilayer-net">Compressing multilayer net</h2>
<p>The above analysis of noise stability in terms of singular values cannot hold across multiple layers of a deep net, because the mapping becomes nonlinear, thus lacking a notion of singular values. Noise stability is therefore formalized using the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian</a> of this mapping, which is the matrix describing how the output reacts to tiny perturbations of the input. Noise stability says that this nonlinear mapping passes signal (i.e., the vector from previous layers) much more strongly than it does a noise vector.</p>
<p>Our compression algorithm applies a randomized transformation to the matrix of each layer (aside: note the use of randomness, which fits in our “compressing with fixed string” framework) that relies on the low stable rank condition at each layer. This compression introduces error in the layer’s output, but the vector describing this error is “gaussian-like” due to the use of randomness in the compression. Thus this error gets attenuated by higher layers.</p>
<p>Details can be found in the paper. All noise stability properties formalized there are later checked in the experiments section.</p>
<h2 id="simpler-proofs-of-existing-generalization-bounds">Simpler proofs of existing generalization bounds</h2>
<p>In the paper we also use our compression framework to give elementary (say, 1-page) proofs of the previous generalization bounds from the past year. For example, the paper of <a href="https://openreview.net/forum?id=Skz_WfbCZ">Neyshabur et al.</a> shows the following is an upper bound on the generalization error where $A_i$ is the matrix describing the $i$th layer.</p>
<p style="text-align:center;">
<img src="/assets/saddle_eff/aexpression1.png" width="50%" alt="Expression for effective number of parameters in Neyshabur et al" />
</p>
<p>Comparing to the <em>basic theorem</em>, we realize the numerator corresponds to the number of effective parameters. The second part of the expression is the sum of stable ranks of the layer matrices, and is a natural measure of complexity. The first part is product of spectral norms (= top singular value) of the layer matrices, which happens to be an upper bound on the Lipschitz constant of the entire network. (Lipschitz constant of a mapping $f$ in this context is a constant $L$ such that $f(x) \leq L c\dot |x|$.)
The reason this is the Lipschitz constant is that if an input $x$ is presented at the bottom of the net, then each successive layer can multiply its norm by at most the top singular value, and the ReLU nonlinearity can only decrease norm since its only action is to zero out some entries.</p>
<p>Having decoded the above expression, it is clear how to interpret it as an analysis of a (deterministic) compression of the net. Compress each layer by zero-ing out (in the <a href="https://en.wikipedia.org/wiki/Singular-value_decomposition">SVD</a>) singular values less than some threshold $t|A|$, which we hope turns it into a low rank matrix. (Recall that a matrix with rank $r$ can be expressed using $2nr$ parameters.) A simple computation shows that the number of remaining singular values is at most the stable rank divided by $t^2$. How do we set $t$? The truncation introduces error in the layer’s computation, which gets propagated through the higher layers and magnified at most by the Lipschitz constant. We want to make this propagated error small, which can be done by making $t$ inversely proportional to the Lipschitz constant. This leads to the above bound on the number of effective parameters.</p>
<p>This proof sketch also clarifies how our work improves upon the older works: they are also (implicitly) compressing the deep net, but their analysis of how much compression is possible is much more pessimistic because they assume the network transmits noise at peak efficiency given by the Lipschitz constant.</p>
<h2 id="extending-the-ideas-to-convolutional-nets">Extending the ideas to convolutional nets</h2>
<p>Convolutional nets could not be dealt with cleanly in the earlier papers. I must admit that handling convolution stumped us as too for a while. A layer in a convolutional net applies the same filter to all patches in that layer. This <em>weight sharing</em> means that the full layer matrix already has a fairly compact representation, and it seems challenging to compress this further. However, in nets like VGG and GoogleNet, the higher layers use rather large filter matrices (i.e., they use a large number of channels), and one could hope to compress these individual filter matrices.</p>
<p>Let’s discuss the two naive ideas. The first is to compress the filter independently in different patches. This unfortunately is not a compression at all, since each copy of the filter then comes with its own parameters. The second idea is to do a single compression of the filter and use the compressed copy in each patch. This messes up the error analysis because the errors introduced due to compression in the different copies are now correlated, whereas the analysis requires them to be more like gaussian.</p>
<p>The idea we end up using is to compress the filters using $k$-wise independence (an idea from <a href="https://en.wikipedia.org/wiki/K-independent_hashing">theory of hashing schemes</a>), where $k$ is roughly logarithmic in the number of training samples.</p>
<h2 id="concluding-thoughts">Concluding thoughts</h2>
<p>While generalization theory can seem merely academic at times —since in practice held-out data establishes generalizaton— I hope you see from the above account that understanding generalization can give some interesting insights into what is going on in deep net training. Insights about noise stability of trained deep nets have obvious interest for study of biological neural nets. (See also the classic <a href="http://fab.cba.mit.edu/classes/862.16/notes/computation/vonNeumann-1956.pdf">von Neumann ideas</a> on noise resilient computation.)</p>
<p>At the same time, I suspect that compressibility is only one part of the generalization mystery, and that we are still missing some big idea. I don’t see how to use the above ideas to demonstrate that the effective number of parameters in VGG19 is as low as $50k$, as seems to be the case. I suspect doing so will force us to understand the structure of the data (in this case, real-life images) which the above analysis mostly ignores. The only property of data used is that the deep net aligns itself better with data than with noise.</p>
Sat, 17 Feb 2018 08:00:00 -0800
http://offconvex.github.io/2018/02/17/generalization2/
http://offconvex.github.io/2018/02/17/generalization2/Generalization Theory and Deep Nets, An introduction<p>Deep learning holds many mysteries for theory, as we have discussed on this blog. Lately many ML theorists have become interested in the generalization mystery: why do trained deep nets perform well on previously unseen data, even though they have way more free parameters than the number of datapoints (the classic “overfitting” regime)? Zhang et al.’s paper <a href="https://arxiv.org/abs/1611.03530">Understanding Deep Learning requires Rethinking Generalization</a> played some role in bringing attention to this challenge. Their main experimental finding is that if you take a classic convnet architecture, say <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">Alexnet</a>, and train it on images with random labels, then you can still achieve very high accuracy on the training data. (Furthermore, usual regularization strategies, which are believed to promote better generalization, do not help much.) Needless to say, the trained net is subsequently unable to predict the (random) labels of still-unseen images, which means it doesn’t generalize. The paper notes that the ability to fit a classifier to data with random labels is also a traditional measure in machine learning called Rademacher complexity (which we will discuss shortly) and thus Rademacher complexity gives no meaningful bounds on sample complexity. I found this paper entertainingly written and recommend reading it, despite having given away the punchline. Congratulations to the authors for winning best paper at ICLR 2017.</p>
<p>But I would be remiss if I didn’t report that at the <a href="https://simons.berkeley.edu/programs/machinelearning2017">Simons Institute Semester on theoretical ML in spring 2017</a> generalization theory experts expressed unhappiness about this paper, and especially its title. They felt that similar issues had been extensively studied in context of simpler models such as kernel SVMs (which, to be fair, is clearly mentioned in the paper). It is trivial to design SVM architectures with high Rademacher complexity which nevertheless train and generalize well on real-life data. Furthermore, theory was developed to explain this generalization behavior (and also for related models like boosting). On a related note, several earlier papers of Behnam Neyshabur and coauthors (see <a href="https://arxiv.org/abs/1605.07154">this paper</a> and for a full account, <a href="https://arxiv.org/abs/1703.11008">Behnam’s thesis</a>)
had made points fairly similar to Zhang et al. pertaining to deep nets.</p>
<p>But regardless of such complaints, we should be happy about the attention brought by Zhang et al.’s paper to a core theory challenge. Indeed, the passionate discussants at the Simons semester themselves banded up in subgroups to address this challenge: these resulted in papers by <a href="https://arxiv.org/abs/1703.11008">Dzigaite and Roy</a>, then <a href="https://arxiv.org/abs/1706.08498">Bartlett, Foster, and Telgarsky</a> and finally <a href="https://arxiv.org/abs/1707.09564">Neyshabur, Bhojapalli, MacAallester, Srebro</a>. (The latter two were presented at NIPS’17 this week.)</p>
<p>Before surveying these results let me start by suggesting that some of the controversy over the title of Zhang et al.’s paper stems from some basic confusion about whether or not current generalization theory is prescriptive or merely descriptive. These confusions arise from the standard treatment of generalization theory in courses and textbooks, as I discovered while teaching the recent developments in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/">my graduate seminar</a>.</p>
<h3 id="prescriptive-versus-descriptive-theory">Prescriptive versus descriptive theory</h3>
<p>To illustrate the difference, consider a patient who says to his doctor: “Doctor, I wake up often at night and am tired all day.”</p>
<blockquote>
<p>Doctor 1 (without any physical examination): “Oh, you have sleep disorder.”</p>
</blockquote>
<p>I call such a diagnosis <em>descriptive</em>, since it only attaches a label to the patient’s problem, without giving any insight into how to solve the problem. Contrast with:</p>
<blockquote>
<p>Doctor 2 (after careful physical examination): “A growth in your sinus is causing sleep apnea. Removing it will resolve your problems.”</p>
</blockquote>
<p>Such a diagnosis is <em>prescriptive.</em></p>
<h2 id="generalization-theory-descriptive-or-prescriptive">Generalization theory: descriptive or prescriptive?</h2>
<p>Generalization theory notions such as VC dimension, Rademacher complexity, and PAC-Bayes bound, consist of attaching a <em>descriptive label</em> to the basic phenomenon of lack of generalization. They are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems.</p>
<p>Recall what it means for a hypothesis/classifier $h$ to not generalize. Assume the training data consists of a sample $S = {(x_1, y_1), (x_2, y_2),\ldots, (x_m, y_m)}$ of $m$ examples from some distribution ${\mathcal D}$. A <em>loss function</em> $\ell$ describes how well hypothesis $h$ classifies a datapoint: the loss $\ell(h, (x, y))$ is high if the hypothesis didn’t come close to producing the label $y$ on $x$ and low if it came close. (To give an example, the <em>regression</em> loss is $(h(x) -y)^2$.) Now let us denote by $\Delta_S(h)$ the average loss on samplepoints in $S$, and by $\Delta_{\mathcal D}(h)$ the expected loss on samples from distribution ${\mathcal D}$.
Training <em>generalizes</em> if the hypothesis $h$ that minimises $\Delta_S(h)$ for a random sample $S$ also achieves very similarly low loss $\Delta_{\mathcal D}(h)$ on the full distribution. When this fails to happen, we have:</p>
<blockquote>
<p><strong>Lack of generalization:</strong> $\Delta_S(h) \ll \Delta_{\mathcal D}(h) \qquad (1). $</p>
</blockquote>
<p>In practice, lack of generalization is detected by taking a second sample
(“held out set”) $S_2$ of size $m$ from ${\mathcal D}$. By concentration bounds expected loss of $h$ on this second sample closely approximates $\Delta_{\mathcal D}(h)$, allowing us to conclude</p>
<script type="math/tex; mode=display">\Delta_S(h) - \Delta_{S_2}(h) \ll 0 \qquad (2).</script>
<h3 id="generalization-theory-descriptive-parts">Generalization Theory: Descriptive Parts</h3>
<p>Let’s discuss <strong>Rademacher complexity,</strong> which I will simplify a bit for this discussion. (See also <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes of my lecture</a>.) For convenience assume in this discussion that labels and loss are $0,1$, and
assume that the badly generalizing $h$ predicts perfectly on the training sample $S$ and is completely wrong on the heldout set $S_2$, meaning</p>
<p>$\Delta_S(h) - \Delta_{S_2}(h) \approx - 1 \qquad (3)$</p>
<p>Rademacher complexity concerns the following thought experiment. Take a single sample of size $2m$ from $\mathcal{D}$, split it into two and call the first half $S$ and the second $S_2$. <em>Flip</em> the labels of points in $S_2$. Now try to find a classifier $C$ that best describes this new sample, meaning one that minimizes $\Delta_S(h) + 1- \Delta_{S_2}(h)$. This expression follows since flipping the label of a point turns good classification into bad and vice versa, and thus the loss function for $S_2$ is $1$ minus the old loss. We say the class of classifiers has high Rademacher complexity if with high probability this quantity is small, say close to $0$.</p>
<p>But a glance at (3) shows that it implies high Rademacher complexity: $S, S_2$ were random samples of size $m$ from $\mathcal{D}$, so their combined size is $2m$, and when generalization failed we succeeded in finding a hypothesis $h$ for which $\Delta_S(h) + 1- \Delta_{S_2}(h)$ is very small.</p>
<p>In other words, returning to our medical analogy, the doctor only had to hear “Generalization didn’t happen” to pipe up with: “Rademacher complexity is high.” This is why I call this result descriptive.</p>
<p>The <strong>VC dimension</strong> bound is similarly descriptive. VC dimension is defined to be at least $k +1$ if there exists a set of size $k$ such that the following is true. If we look at all possible classifiers in the class, and the sequence of labels each gives to the $k$ datapoints in the sample, then we can find all possible $2^{k}$ sequences of $0$’s and $1$’s.</p>
<p>If generalization does not happen as in (2) or (3) then this turns out to imply that VC dimension is at least around $\epsilon m$ for some $\epsilon >0$. The reason is that the $2m$ data points were split randomly into $S, S_2$, and there are $2^{2m}$ such splittings. When the generalization error is $\Omega(1)$ this can be shown to imply that we can achieve $2^{\Omega(m)}$ labelings of the $2m$ datapoints using all possible classifiers. Now the classic Sauer’s lemma (see any lecture notes on this topic, such as <a href="https://www.cs.princeton.edu/courses/archive/spring14/cos511/scribe_notes/0220.pdf">Schapire’s</a>) can be used to show that
VC dimension is at least $\epsilon m/\log m$ for some constant $\epsilon>0$.</p>
<p>Thus again, the doctor only has to hear “Generalization didn’t happen with sample size $m$” to pipe up with: “VC dimension is higher than $\Omega(m/log m)$.”</p>
<p>One can similarly show that PAC-Bayes bounds are also descriptive, as you can see in <a href="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/generalize.pdf">scribe notes from my lecture</a>.</p>
<blockquote>
<p>Why do students get confused and think that such tools of generalization theory gives some powerful technique to guide design of machine learning algorithms?</p>
</blockquote>
<p>Answer: Probably because standard presentation in lecture notes and textbooks seems to pretend that we are computationally-omnipotent beings who can <em>compute</em> VC dimension and Rademacher complexity and thus arrive at meaningful bounds on sample sizes needed for training to generalize. While this may have been possible in the old days with simple classifiers, today we have
complicated classifiers with millions of variables, which furthermore are products of nonconvex optimization techniques like backpropagation.
The only way to actually lowerbound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a held-out set. Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al. for highlighting that theory currently offers nothing better.</p>
<h2 id="toward-a-prescriptive-generalization-theory-the-new-papers">Toward a prescriptive generalization theory: the new papers</h2>
<p>In our medical analogy we saw that the doctor needs to at least do a physical examination to have a prescriptive diagnosis. The authors of the new papers intuitively grasp this point, and try to identify properties of real-life deep nets that may lead to better generalization. Such an analysis (related to “margin”) was done for simple 2-layer networks couple decades ago, and the challenge is to find analogs for multilayer networks. Both Bartlett et al. and Neyshabur et al. hone in on <a href="https://nickhar.wordpress.com/2012/02/29/lecture-15-low-rank-approximation-of-matrices/"><em>stable rank</em></a> of the weight matrices of the layers of the deep net. These can be seen as an instance of a “flat minimum” which has been discussed in <a href="http://www.bioinf.jku.at/publications/older/3304.pdf">neural nets literature</a> for many years. I will present my take on these results as well as some improvements in a future post. Note that these methods do not as yet give any nontrivial bounds on the number of datapoints needed for training the nets in question.</p>
<p><a href="https://arxiv.org/abs/1703.11008">Dziugaite and Roy</a> take a slightly different tack. They start with McAllester’s 1999 PAC-Bayes bound, which says that if the algorithm’s prior distribution on the hypotheses is $P$ then for every posterior distributions $Q$ (which could depend on the data) on the hypotheses the generalization error of the average classifier picked according to $Q$ is upper bounded as follows where $D()$ denotes <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>:</p>
<div style="text-align:center;">
<img style="width:600px;" src="http://www.cs.princeton.edu/courses/archive/fall17/cos597A/lecnotes/pacbayes.png" />
</div>
<p>This allows upperbounds on generalization error (specifically, upperbounds on number of samples that guarantee such an upperbound) by proceeding as in <a href="http://www.cs.cmu.edu/~jcl/papers/nn_bound/not_bound.pdf">Langford and Caruana’s old paper</a> where $P$ is a uniform gaussian, and $Q$ is a noised version of the trained deep net (whose generalization we are trying to explain). Specifically, if $w_{ij}$ is the weight of edge ${i, j}$ in the trained net, then $Q$ consists of adding a gaussian noise $\eta_{ij}$ to weight $w_{ij}$. Thus a random classifier according to $Q$ is nothing but a noised version of the trained net. Now we arrive at the crucial idea: Use nonconvex optimization to find a choice for the variance of $\eta_{ij}$ that balances two competing criteria: (a) the average classifier drawn from $Q$ has training error not much more than the original trained net (again, this is a quantification of the “flatness” of the minimum found by the optimization) and (b) the right hand side of the above expression is as small as possible. Assuming (a) and (b) can be suitably bounded, it follows that the average classifier from Q works reasonably well on unseen data. (Note that this method only proves generalization of a noised version of the trained classifier.)</p>
<p>Applying this method on simple fully-connected neural nets trained on MNIST dataset, they can prove that the method achieves error $17$ percent error on MNIST (whereas the <em>actual</em> error is much lower at 2-3 percent). Hence the title of their paper, which promises <em>nonvacuous generalization bounds.</em> What I find most interesting about this result is that it uses the power of nonconvex optimization (harnessed above to find a suitable noised distribution $Q$) to cast light on one of the metaquestions about nonconvex optimization, namely, why does deep learning not overfit!</p>
Fri, 08 Dec 2017 10:00:00 -0800
http://offconvex.github.io/2017/12/08/generalization1/
http://offconvex.github.io/2017/12/08/generalization1/