Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Does Gradient Flow Over Neural Networks Really Represent Gradient Descent?<h2 id="tldr">TL;DR</h2>
<p>A lot was said in this blog (cf. <a href="http://www.offconvex.org/2019/06/03/trajectories/">post by Sanjeev</a>) about the importance of studying trajectories of gradient descent (<strong><em>GD</em></strong>) for understanding deep learning.
Researchers often conduct such studies by considering <em>gradient flow</em> (<strong><em>GF</em></strong>), equivalent to GD with infinitesimally small step size.
Much was learned from analyzing GF over neural networks (<strong><em>NNs</em></strong>), but to what extent <strong><em>do results for GF apply to GD with practical step size?</em></strong>
This is an open question in deep learning theory.
My student Omer Elkabetz and I investigated it in a recent NeurIPS 2021 spotlight <a href="https://openreview.net/forum?id=iX0TSH45eOd">paper</a>.
In a nutshell, we found that, although in general an exponentially small step size is required for guaranteeing that GD is well represented by GF, specifically over NNs, a much larger step size can suffice.
This allows <strong><em>immediate translation of analyses for GF over NNs to results for GD</em></strong>.
The translation bears potential to shed light on both optimization and generalization (implicit regularization) in deep learning, and indeed, we exemplify its use for proving what is, to our knowledge, the <strong><em>first guarantee of random near-zero initialization almost surely leading GD over a deep (three or more layer) NN of fixed size to efficiently converge to global minimum</em></strong>.
The remainder of this post provides more details; for the full story see our <a href="https://openreview.net/forum?id=iX0TSH45eOd">paper</a>.</p>
<h2 id="gf-a-continuous-surrogate-for-gd">GF: a continuous surrogate for GD</h2>
<p>Let $f : \mathbb{R}^d \to \mathbb{R}$ be an objective function (e.g. training loss of deep NN) that we would like to minimize via GD with step size $\eta > 0$:
[
\boldsymbol\theta_{k + 1} = \boldsymbol\theta_k - \eta \nabla f ( \boldsymbol\theta_k ) ~ ~ ~ \text{for} ~ k = 0 , 1 , 2 , …
\qquad \color{green}{\text{(GD)}}
]
We may imagine a continuous curve $\boldsymbol\theta : [ 0 , \infty ) \to \mathbb{R}^d$ that passes through the $k$’th GD iterate at time $t = k \eta$.
This would imply $\boldsymbol\theta ( t + \eta ) = \boldsymbol\theta ( t ) - \eta \nabla f ( \boldsymbol\theta ( t ) )$, which we can write as $\frac{1}{\eta} ( \boldsymbol\theta ( t + \eta ) - \boldsymbol\theta ( t ) ) = - \nabla f ( \boldsymbol\theta ( t ) )$.
In the limit of infinitesimally small step size ($\eta \to 0$), we obtain the following characterization for $\boldsymbol\theta ( \cdot )$:
[
\frac{d}{dt} \boldsymbol\theta ( t ) = - \nabla f ( \boldsymbol\theta ( t ) ) ~ ~ ~ \text{for} ~ t \geq 0
.
\qquad \color{blue}{\text{(GF)}}
]
This differential equation is known as GF, and represents a continuous surrogate for GD.</p>
<div style="text-align:center;">
<img style="width:540px;" src="http://www.offconvex.org/assets/gf-gd-illustration.png" />
<br />
<i><b>Figure 1:</b>
Illustration of GD and its continuous surrogate GF.
</i>
</div>
<p><br /></p>
<p>GF brings forth the possibility of employing a vast array of continuous mathematical machinery for studying GD.
It is for this reason that GF has become a popular model in deep learning theory (see our <a href="https://openreview.net/forum?id=iX0TSH45eOd">paper</a> for a long list of works analyzing GF over NNs).
There is only one problem: GF assumes infinitesimal step size for GD!
This is of course impractical, leading to the following open question $-$ the topic of our <a href="https://openreview.net/forum?id=iX0TSH45eOd">work</a>:</p>
<blockquote>
<p><strong>Open Question:</strong>
Does GF over NNs represent GD with practical step size?</p>
</blockquote>
<h2 id="gd-as-numerical-integrator-for-gf">GD as numerical integrator for GF</h2>
<p>The question of proximity between GF and GD (with positive step size) is closely related to an area of numerical analysis known as <em>numerical integration</em>.
There, the motivation is the other way around $-$ given a differential equation $\frac{d}{dt} \boldsymbol\theta ( t ) = \boldsymbol g ( \boldsymbol\theta ( t ) )$ induced by some vector field $\boldsymbol g : \mathbb{R}^d \to \mathbb{R}^d$, the interest lies on a continuous solution $\boldsymbol\theta : [ 0 , \infty ) \to \mathbb{R}^d$, and numerical (discrete) integration algorithms are used for obtaining approximations.
A classic numerical integrator, known as <em>Euler’s method</em>, is given by $\boldsymbol\theta_{k + 1} = \boldsymbol\theta_k + \eta \boldsymbol g ( \boldsymbol\theta_k )$ for $k = 0 , 1 , 2 , …$, where $\eta > 0$ is a predetermined step size.
With Euler’s method, the goal is for the $k$’th iterate to approximate the sought-after continuous solution at time $k \eta$, meaning $\boldsymbol\theta_k \approx \boldsymbol\theta ( k \eta )$.
Notice that when the vector field $\boldsymbol g ( \cdot )$ is chosen to be minus the gradient of an objective function $f : \mathbb{R}^d \to \mathbb{R}$, i.e. $\boldsymbol g ( \cdot ) = - \nabla f ( \cdot )$, the given differential equation is no other than GF, and its numerical integration via Euler’s method yields no other than GD!
We can therefore employ known results from numerical integration to bound the distance between GF and GD!</p>
<h2 id="gf-matches-gd-if-its-trajectory-is-roughly-convex">GF matches GD if its trajectory is roughly convex</h2>
<p>There exist classic results bounding the approximation error of Euler’s method, but these are too coarse for our purposes.
Instead, we invoke a modern result known as “Fundamental Theorem” (cf. <a href="https://link.springer.com/book/10.1007/978-3-540-78862-1">Hairer et al. 1993</a>), which implies the following:</p>
<blockquote>
<p><strong>Theorem 1:</strong>
The distance between the GF trajectory $\boldsymbol\theta ( \cdot )$ at time $t$, and iterate $k = t / \eta$ of GD, is upper bounded as:</p>
<p>[
|| \boldsymbol\theta ( t ) - \boldsymbol\theta_{k = t / \eta} || \leq \mathcal{O} \Big( e^{- \smallint_0^{~ t} \lambda_- ( t’ ) dt’} t \eta \Big)
,
]
where $\lambda_- ( t’ ) := \min \left( \lambda_{min} \left( \nabla^2 f ( \boldsymbol\theta ( t’ ) ) \right) , 0 \right)$, i.e. $\lambda_- ( t’ )$ is defined to be the negative part of the minimal eigenvalue of the Hessian on the GF trajectory at time $t’$.</p>
</blockquote>
<p>As expected, by using a sufficiently small step size $\eta$, we may ensure that GD is arbitrarily close to GF for arbitrarily long.
How small does $\eta$ need to be?
For the theorem to guarantee that GD follows GF up to time $t$, we must have $\eta \in \mathcal{O} \big( e^{\smallint_0^{~ t} \lambda_- ( t’ ) dt’} / t \big)$.
In particular, $\eta$ must be exponential in $\smallint_0^{~ t} \lambda_- ( t’ ) dt’$, i.e. in the integral of (the negative part of) the minimal Hessian eigenvalue along the GF trajectory (up to time $t$).
If the optimized objective $f ( \cdot )$ is convex then Hessian eigenvalues are everywhere non-negative, and therefore $\lambda_- ( \cdot ) \equiv 0$, which means that a moderately small $\eta$ (namely, $\eta \in \mathcal{O} ( 1 / t )$) suffices.
If on the other hand $f ( \cdot )$ is non-convex then Hessian eigenvalues may be negative, meaning $\lambda_- ( \cdot )$ may be negative, which in turn implies that $\eta$ may have to be exponentially small.
We prove in our <a href="https://openreview.net/forum?id=iX0TSH45eOd">paper</a> that there indeed exist cases where an exponentially small step size is unavoidable:</p>
<blockquote>
<p><strong>Proposition 1:</strong>
For any $m > 0$, there exist (non-convex) objectives on which the GF trajectory at time $t$ is <strong>not</strong> approximated by GD if $\eta \notin \mathcal{O} ( e^{- m t} )$.</p>
</blockquote>
<p>Despite this negative result, not all hope is lost.
It might be that even though a given objective function is non-convex, specifically over GF trajectories it is “roughly convex,” in the sense that along these trajectories the minimal eigenvalue of the Hessian is “almost non-negative.”
This would mean that $\lambda_- ( \cdot )$ is almost non-negative, in which case (by Theorem 1) a moderately small step size suffices in order for GD to track GF!</p>
<h2 id="trajectories-of-gf-over-nns-are-roughly-convex">Trajectories of GF over NNs are roughly convex</h2>
<p>Being interested in the match between GF and GD over NNs, we analyzed the geometry of GF trajectories on training losses of NNs with homogeneous activations (e.g. linear, ReLU, leaky ReLU).
The following theorem (informally stated) was proven:</p>
<blockquote>
<p><strong>Theorem 2:</strong>
For a training loss of a NN with homogeneous activations, the minimal Hessian eigenvalue is arbitrarily negative across space, but along GF trajectories initialized near zero it is almost non-negative.</p>
</blockquote>
<p>Combined with Theorem 1, this theorem suggests that over NNs (with homogeneous activations), in the common regime of near-zero initialization, a moderately small step size for GD suffices in order for it to be well represented by GF!
We verify this prospect empirically, demonstrating that in basic deep learning settings, reducing the step size for GD often leads to only slight changes in its trajectory.</p>
<div style="text-align:center;">
<img style="width:600px;" src="http://www.offconvex.org/assets/gf-gd-experiment.png" />
<br />
<i><b>Figure 2:</b>
Experiment with NN comparing every iteration of GD with step size $\eta_0 := 0.001$, to every $r$'th iteration of GD with step size $\eta_0 / r$, where $r = 2 , 5 , 10 , 20$.
Left plot shows training loss values; right one shows distance (in weight space) of GD with step size $\eta_0$ from initialization, against its distance from runs with smaller step size.
Takeaway: reducing step size barely made a difference, suggesting GD was already close to the continuous (GF) limit.
</i>
</div>
<h2 id="translating-analyses-of-gf-to-results-for-gd">Translating analyses of GF to results for GD</h2>
<p>Theorems 1 and 2 together form a tool for automatically translating analyses of GF over NNs to results for GD with practical step size.
This means that a vast array of continuous mathematical machinery available for analyzing GF can now be leveraged for formally studying practical NN training!
Since analyses of GF over NNs often establish convergence to global minimum and/or characterize the solution found (again, see our <a href="https://openreview.net/forum?id=iX0TSH45eOd">paper</a> for long list of examples), the translation we developed bears potential to shed new light on both optimization and generalization (implicit regularization) in deep learning.
To demonstrate this point, we analyze GF over arbitrarily deep linear NNs with scalar output, and prove the following result:</p>
<blockquote>
<p><strong>Proposition 2:</strong>
GF over an arbitrarily deep linear NN with scalar output converges to global minimum almost surely (i.e. with probability one) under a random near-zero initialization.</p>
</blockquote>
<p>Applying our translation yields an analogous result for GD with practical step size:</p>
<blockquote>
<p><strong>Theorem 3:</strong>
GD over an arbitrarily deep linear NN with scalar output efficiently converges to global minimum almost surely under a random near-zero initialization.</p>
</blockquote>
<p>To the best of our knowledge, this is the first guarantee of random near-zero initialization <strong><em>almost surely</em></strong> leading GD over a deep (three or more layer) NN of fixed size to efficiently converge to global minimum!</p>
<h2 id="what-about-large-step-size-momentum-stochasticity">What about large step size, momentum, stochasticity?</h2>
<p>An emerging belief (see our <a href="https://openreview.net/forum?id=iX0TSH45eOd">paper</a> for several supporting references) is that for GD over NNs, large step size can be beneficial in terms of generalization.
While the large step size regime isn’t necessarily captured by standard GF, recent works (e.g. <a href="https://openreview.net/pdf?id=3q5IqUrkcF">Barrett & Dherin 2021</a>, <a href="https://openreview.net/pdf?id=q8qLAbQBupm">Kunin et al. 2021</a>) argue that it is captured by certain modifications of GF.
Modifications were also proposed for capturing other aspects of NN training, for example momentum (cf. <a href="https://jmlr.org/papers/volume17/15-084/15-084.pdf">Su et al. 2016</a>, <a href="https://arxiv.org/pdf/1603.04245.pdf">Wibisono et al. 2016</a>, <a href="http://proceedings.mlr.press/v80/franca18a/franca18a.pdf">Franca et al. 2018</a>, <a href="https://jmlr.org/papers/volume22/20-195/20-195.pdf">Wilson et al. 2021</a>) and stochasticity (see, e.g., <a href="https://jmlr.org/papers/volume20/17-526/17-526.pdf">Li et al. 2017</a>, <a href="https://openreview.net/pdf?id=rq_Qr0c1Hyo">Smith et al. 2021</a>, <a href="https://openreview.net/pdf?id=goEdyJ_nVQI">Li et al. 2021</a>).
Extending our GF-to-GD translation machinery to account for modifications as above would be very interesting in my opinion.
All in all, I believe that in the years to come, the vast knowledge on continuous dynamical systems, and GF in particular, will unravel many mysteries behind deep learning.</p>
<p><a href="http://www.cohennadav.com/"><strong><em>Nadav Cohen</em></strong></a></p>
<p><br />
<strong><em>Thanks:</em></strong>
<em>I’d like to thank many people with whom I’ve had illuminating discussions on GF vs. GD over NNs.
These include <a href="https://noahgol.github.io/">Noah Golowich</a>, <a href="https://weihu.me/">Wei Hu</a>, <a href="https://www.cs.princeton.edu/~zl4/">Zhiyuan Li</a> and Kaifeng Lyu.
Special thanks to <a href="https://www.dam.brown.edu/people/menon/">Govind Menon</a>, who drew my attention to the connection to numerical integration, and of course to <a href="https://www.cs.princeton.edu/~arora/">Sanjeev</a>, who has been a companion and guide in promoting the “trajectory approach” to deep learning.</em></p>
Thu, 06 Jan 2022 04:00:00 -0800
http://offconvex.github.io/2022/01/06/gf-gd/
http://offconvex.github.io/2022/01/06/gf-gd/Implicit Regularization in Tensor Factorization: Can Tensor Rank Shed Light on Generalization in Deep Learning?<p>In effort to understand implicit regularization in deep learning, a lot of theoretical focus is being directed at matrix factorization, which can be seen as linear neural networks.
This post is based on our <a href="https://arxiv.org/pdf/2102.09972.pdf">recent paper</a> (to appear at ICML 2021), where we take a step towards practical deep learning, by investigating <em>tensor factorization</em> — a model equivalent to a certain type of non-linear neural networks.
It is well known that <a href="https://arxiv.org/pdf/0911.1393.pdf">most tensor problems are NP-hard</a>, and accordingly, the common sentiment is that working with tensors (in both theory and practice) entails extreme difficulties.
However, by adopting a dynamical systems view, we manage to avoid such difficulties, and establish an implicit regularization towards low <em>tensor rank</em>.
Our results suggest that tensor rank may shed light on generalization in deep learning.</p>
<h2 id="challenge-finding-a-right-measure-of-complexity">Challenge: finding a right measure of complexity</h2>
<p>Overparameterized neural networks are mysteriously able to generalize even when trained without any explicit regularization.
Per conventional wisdom, this generalization stems from an <em>implicit regularization</em> — a tendency of gradient-based optimization to fit training examples with predictors of minimal ‘‘complexity.’’
A major challenge in translating this intuition to provable guarantees is that we lack measures for predictor complexity that are quantitative (admit generalization bounds), and at the same time, capture the essence of natural data (images, audio, text etc.), in the sense that it can be fit with predictors of low complexity.</p>
<div style="text-align:center;">
<img style="width:500px;padding-bottom:10px;padding-top:5px" src="http://www.offconvex.org/assets/reg_tf/imp_reg_tf_data_complexity.png" />
<br />
<i><b>Figure 1:</b>
To explain generalization in deep learning, a complexity <br />
measure must allow the fit of natural data with low complexity. On the <br />
other hand, when fitting data which does not admit generalization, <br />
e.g. random data, the complexity should be high.
</i>
</div>
<p><br /></p>
<h2 id="a-common-testbed-matrix-factorization">A common testbed: matrix factorization</h2>
<p>Without a clear complexity measure for practical neural networks, existing analyses usually focus on simple settings where a notion of complexity is obvious.
A common example of such a setting is <em>matrix factorization</em> — matrix completion via linear neural networks.
This model was discussed pretty extensively in previous posts (see <a href="http://www.offconvex.org/2019/06/03/trajectories/">one</a> by Sanjeev, <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">one</a> by Nadav and Wei and <a href="https://www.offconvex.org/2020/11/27/reg_dl_not_norm/">another one</a> by Nadav), but for completeness we present it again here.</p>
<p>In <em>matrix completion</em> we’re given a subset of entries from an unknown matrix $W^* \in \mathbb{R}^{d, d’}$, and our goal is to predict the unobserved entries.
This can be viewed as a supervised learning problem with $2$-dimensional inputs, where the label of the input $( i , j )$ is $( W^* )_{i,j}$.
Under such a viewpoint, the observed entries are the training set, and the average reconstruction error over unobserved entries is the test error, quantifying generalization.
A predictor can then be thought of as a matrix, and a natural notion of complexity is its <em>rank</em>.
Indeed, in many real-world scenarios (a famous example is the <a href="https://en.wikipedia.org/wiki/Netflix_Prize">Netflix Prize</a>) one is interested in <a href="https://arxiv.org/pdf/1601.06422.pdf">recovering a low rank matrix from incomplete observations</a>.</p>
<p>A ‘‘deep learning approach’’ to matrix completion is matrix factorization, where the idea is to use a linear neural network (fully connected neural network with no non-linearity), and fit observations via gradient descent (GD).
This amounts to optimizing the following objective:</p>
<div style="text-align:center;">
\[
\min\nolimits_{W_1 , \ldots , W_L} ~ \sum\nolimits_{(i,j) \in observations} \big[ ( W_L \cdots W_1 )_{i , j} - (W^*)_{i,j} \big]^2 ~.
\]
</div>
<p>It is obviously possible to constrain the rank of the produced solution by limiting the shared dimensions of the weight matrices $\{ W_j \}_j$.
However, from an implicit regularization standpoint, the most interesting case is where rank is unconstrained and the factorization can express any matrix.
In this case there is no explicit regularization, and the kind of solution we get is determined implicitly by the parameterization and the optimization algorithm.</p>
<p>As it turns out, in practice, matrix factorization with near-zero initialization and small step size tends to accurately recover low rank matrices.
This phenomenon (first identified in <a href="https://papers.nips.cc/paper/2017/file/58191d2a914c6dae66371c9dcdc91b41-Paper.pdf">Gunasekar et al. 2017</a>) manifests some kind of implicit regularization, whose mathematical characterization drew a lot of interest.
It was initially conjectured that matrix factorization implicitly minimizes nuclear norm (<a href="https://papers.nips.cc/paper/2017/file/58191d2a914c6dae66371c9dcdc91b41-Paper.pdf">Gunasekar et al. 2017</a>), but recent evidence points to implicit rank minimization, stemming from incremental learning dynamics (see <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">Arora et al. 2019</a>; <a href="https://papers.nips.cc/paper/2020/file/f21e255f89e0f258accbe4e984eef486-Paper.pdf">Razin & Cohen 2020</a>; <a href="https://openreview.net/pdf/e29b53584bc9017cb15b9394735cd51b56c32446.pdf">Li et al. 2021</a>).
Today, it seems we have a relatively firm understanding of generalization in matrix factorization.
There is a complexity measure for predictors — matrix rank — by which implicit regularization strives to lower complexity, and the data itself is of low complexity (i.e. can be fit with low complexity).
Jointly, these two conditions lead to generalization.</p>
<h2 id="beyond-matrix-factorization-tensor-factorization">Beyond matrix factorization: tensor factorization</h2>
<p>Matrix factorization is interesting on its own behalf, but as a theoretical surrogate for deep learning it is limited.
First, it corresponds to <em>linear</em> neural networks, and thus misses the crucial aspect of non-linearity.
Second, viewing matrix completion as a prediction problem, it doesn’t capture tasks with more than two input variables.
As we now discuss, both of these limitations can be lifted if instead of matrices one considers tensors.</p>
<p>A tensor can be thought of as a multi-dimensional array.
The number of axes in a tensor is called its <em>order</em>.
In the task of <em>tensor completion</em>, a subset of entries from an unknown tensor $\mathcal{W}^* \in \mathbb{R}^{d_1, \ldots, d_N}$ are given, and the goal is to predict the unobserved entries.
Analogously to how matrix completion can be viewed as a prediction problem over two input variables, order-$N$ tensor completion can be seen as a prediction problem over $N$ input variables (each corresponding to a different axis).
In fact, any multi-dimensional prediction task with discrete inputs and scalar output can be formulated as a tensor completion problem.
Consider for example the <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST dataset</a>, and for simplicity assume that image pixels hold one of two values, i.e. are either black or white.
The task of predicting labels for the $28$-by-$28$ binary images can be seen as an order-$784$ (one axis for each pixel) tensor completion problem, where all axes are of length $2$ (corresponding to the number of values a pixel can take).
For further details on how general prediction tasks map to tensor completion problems see <a href="https://arxiv.org/pdf/2102.09972.pdf">our paper</a>.</p>
<div style="text-align:center;">
<img style="width:550px;padding-bottom:10px;padding-top:5px" src="http://www.offconvex.org/assets/reg_tf/pred_prob_to_tensor_comp.png" />
<br />
<i><b>Figure 2:</b>
Prediction tasks can be viewed as tensor completion problems. <br />
For example, predicting labels for input images with $3$ pixels, each taking <br />
one of $5$ grayscale values, corresponds to completing a $5 \times 5 \times 5$ tensor.
</i>
</div>
<p><br />
Like matrices, tensors can be factorized.
The most basic scheme for factorizing tensors, named CANDECOMP/PARAFAC (CP), parameterizes a tensor as a sum of outer products (for information on this scheme, as well as others, see the <a href="http://www.kolda.net/publication/TensorReview.pdf">excellent survey</a> of Kolda and Bader).
In <a href="https://arxiv.org/pdf/2102.09972.pdf">our paper</a> and this post, we use the term <em>tensor factorization</em> to refer to solving tensor completion by fitting observations via GD over CP parameterization, i.e. over the following objective ($\otimes$ here stands for outer product):</p>
<div style="text-align:center;">
\[
\min\nolimits_{ \{ \mathbf{w}_r^n \}_{r , n} } \sum\nolimits_{ (i_1 , ... , i_N) \in observations } \big[ \big( {\textstyle \sum}_{r = 1}^R \mathbf{w}_r^1 \otimes \cdots \otimes \mathbf{w}_r^N \big)_{i_1 , \ldots , i_N} - (\mathcal{W}^*)_{i_1 , \ldots , i_N} \big]^2 ~.
\]
</div>
<p>The concept of rank naturally extends from matrices to tensors.
The <em>tensor rank</em> of a given tensor $\mathcal{W}$ is defined to be the minimal number of components (i.e. of outer product summands) $R$ required for CP parameterization to express it.
Note that for order-$2$ tensors, i.e. for matrices, this exactly coincides with matrix rank.
We can explicitly constrain the tensor rank of solutions found by tensor factorization via limiting the number of components $R$.
However, since our interest lies on implicit regularization, we consider the case where $R$ is large enough for any tensor to be expressed.</p>
<p>By now you might be wondering what does tensor factorization have to do with deep learning.
Apparently, as Nadav mentioned in an <a href="http://www.offconvex.org/2020/11/27/reg_dl_not_norm/">earlier post</a>, analogously to how matrix factorization is equivalent to matrix completion (two-dimensional prediction) via linear neural networks, tensor factorization is equivalent to tensor completion (multi-dimensional prediction) with a certain type of <em>non-linear</em> neural networks (for the exact details behind the latter equivalence see <a href="https://arxiv.org/pdf/2102.09972.pdf">our paper</a>).
It therefore represents a setting one step closer to practical neural networks.</p>
<div style="text-align:center;">
<img style="width:900px;padding-bottom:10px;padding-top:5px" src="http://www.offconvex.org/assets/reg_tf/mf_lnn_tf_nonlinear.png" />
<br />
<i><b>Figure 3:</b>
While matrix factorization corresponds to a linear neural network, <br />
tensor factorization corresponds to a certain non-linear neural network.
</i>
</div>
<p><br />
As a final piece of the analogy between matrix and tensor factorizations, in a <a href="https://arxiv.org/pdf/2005.06398.pdf">previous paper</a> (described in an <a href="https://www.offconvex.org/2020/11/27/reg_dl_not_norm/">earlier post</a>) Noam and Nadav demonstrated empirically that (similarly to the phenomenon discussed above for matrices) tensor factorization with near-zero initialization and small step size tends to accurately recover low rank tensors.
Our goal in the <a href="https://arxiv.org/pdf/2102.09972.pdf">current paper</a> was to mathematically explain this finding.
To avoid the <a href="https://arxiv.org/pdf/0911.1393.pdf">notorious difficulty of tensor problems</a>, we chose to adopt a dynamical systems view, and analyze directly the trajectories induced by GD.</p>
<h2 id="dynamical-analysis-implicit-tensor-rank-minimization">Dynamical analysis: implicit tensor rank minimization</h2>
<p>So what can we say about the implicit regularization in tensor factorization?
At the core of our analysis is the following dynamical characterization of component norms:</p>
<blockquote>
<p><strong>Theorem:</strong>
Running gradient flow (GD with infinitesimal step size) over a tensor factorization with near-zero initialization leads component norms to evolve by:
[ \frac{d}{dt} || \mathbf{w}_r^1 (t) \otimes \cdots \otimes \mathbf{w}_r^N (t) || \propto \color{brown}{|| \mathbf{w}_r^1 (t) \otimes \cdots \otimes \mathbf{w}_r^N (t) ||^{2 - 2/N}} ~,
]
where $\mathbf{w}_r^1 (t), \ldots, \mathbf{w}_r^N (t)$ denote the weight vectors at time $t \geq 0$.</p>
</blockquote>
<p>According to the theorem above, component norms evolve at a rate proportional to their size exponentiated by $\color{brown}{2 - 2 / N}$ (recall that $N$ is the order of the tensor to complete).
Consequently, they are subject to a momentum-like effect, by which they move slower when small and faster when large.
This suggests that when initialized near zero, components tend to remain close to the origin, and then, after passing a critical threshold, quickly grow until convergence.
Intuitively, these dynamics induce an incremental process where components are learned one after the other, leading to solutions with a few large components and many small ones, i.e. to (approximately) low tensor rank solutions!</p>
<p>We empirically verified the incremental learning of components in many settings.
Here is a representative example from one of our experiments (see <a href="https://arxiv.org/pdf/2102.09972.pdf">the paper</a> for more):</p>
<div style="text-align:center;">
<img style="width:800px;padding-bottom:15px;padding-top:10px;" src="http://www.offconvex.org/assets/reg_tf/tf_dyn_exps.png" />
<br />
<i><b>Figure 4:</b>
Dynamics of component norms during GD over tensor factorization. <br />
An incremental learning effect is enhanced as initialization scale decreases, <br />
leading to accurate completion of a low rank tensor.
</i>
</div>
<p><br />
Using our dynamical characterization of component norms, we were able to prove that with sufficiently small initialization, tensor factorization (approximately) follows a trajectory of rank one tensors for an arbitrary amount of time.
This leads to:</p>
<blockquote>
<p><strong>Theorem:</strong>
If tensor completion has a rank one solution, then under certain technical conditions, tensor factorization will reach it.</p>
</blockquote>
<p>It’s worth mentioning that, in a way, our results extend to tensor factorization the incremental rank learning dynamics known for matrix factorization (cf. <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">Arora et al. 2019</a> and <a href="https://arxiv.org/pdf/2012.09839v1.pdf">Li et al. 2021</a>).
As typical when transitioning from matrices to tensors, this extension entailed various challenges that necessitated use of different techniques.</p>
<h2 id="tensor-rank-as-measure-of-complexity">Tensor rank as measure of complexity</h2>
<p>Going back to the beginning of the post, recall that a major challenge towards understanding implicit regularization in deep learning is that we lack measures for predictor complexity that capture natural data.
Now, let us recap what we have seen thus far:
$(1)$ tensor completion is equivalent to multi-dimensional prediction;
$(2)$ tensor factorization corresponds to solving the prediction task with certain non-linear neural networks;
and
$(3)$ the implicit regularization of these non-linear networks, i.e. of tensor factorization, minimizes tensor rank.
Motivated by these findings, we ask the following:</p>
<blockquote>
<p><strong>Question:</strong>
Can tensor rank serve as a measure of predictor complexity?</p>
</blockquote>
<p>We empirically explored this prospect by evaluating the extent to which tensor rank captures natural data, i.e. to which natural data can be fit with predictors of low tensor rank.
As testbeds we used <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST</a> and <a href="https://github.com/zalandoresearch/fashion-mnist">Fashion-MNIST</a> datasets, comparing the resulting errors against those obtained when fitting two randomized variants: one generated via shuffling labels (‘‘rand label’’), and the other by replacing inputs with noise (‘‘rand image’’).</p>
<p>The following plot, displaying results for Fashion-MNIST (those for MNIST are similar), shows that with predictors of low tensor rank the original data is fit way more accurately than the randomized datasets.
Specifically, even with tensor rank as low as one the original data is fit relatively well, while the error in fitting random data is close to trivial (variance of the label).
This suggests that tensor rank as a measure of predictor complexity has potential to capture aspects of natural data!
Note also that an accurate fit with low tensor rank coincides with low test error, which is not surprising given that low tensor rank predictors can be described with a small number of parameters.</p>
<div style="text-align:center;">
<img style="width:600px;padding-bottom:15px;padding-top:10px;" src="http://www.offconvex.org/assets/reg_tf/exp_complexity_fmnist.png" />
<br />
<i><b>Figure 5:</b>
Evaluation of tensor rank as a measure of complexity — standard datasets <br />
can be fit accurately with predictors of low tensor rank (far beneath what is required by <br />
random datasets), suggesting it may capture aspects of natural data. Plot shows mean <br />
error of predictors with low tensor rank over Fashion-MNIST. Markers correspond <br />
to separate runs differing in the explicit constraint on the tensor rank.
</i>
</div>
<h2 id="concluding-thoughts">Concluding thoughts</h2>
<p>Overall, <a href="https://arxiv.org/pdf/2102.09972.pdf">our paper</a> shows that tensor rank captures both the implicit regularization of a certain type of non-linear neural networks, and aspects of natural data.
In light of this, we believe tensor rank (or more advanced notions such as hierarchical tensor rank) might pave way to explaining both implicit regularization in more practical neural networks, and the properties of real-world data translating this implicit regularization to generalization.</p>
<p><a href="https://noamrazin.github.io/">Noam Razin</a>, <a href="https://asafmaman101.github.io/">Asaf Maman</a>, <a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Thu, 08 Jul 2021 02:00:00 -0700
http://offconvex.github.io/2021/07/08/imp-reg-tf/
http://offconvex.github.io/2021/07/08/imp-reg-tf/Rip van Winkle's Razor, a Simple New Estimate for Adaptive Data Analysis<p><em>Can you trust a model whose designer had access to the test/holdout set?</em> This implicit question
in <a href="https://science.sciencemag.org/content/349/6248/636.full">Dwork et al 2015</a> launched a new field, <em>adaptive data analysis</em>.
The question referred to the fact that in many scientific settings as well as modern machine learning (with its standardized datasets like CIFAR,
ImageNet etc.) the model designer has full access to the holdout set and is free to ignore the</p>
<blockquote>
<p>(Basic Dictum of Data Science) “Thou shalt not train on the test/holdout set.”</p>
</blockquote>
<p>Furthermore, even researchers who scrupulously follow the Basic Dictum may be unknowingly violating it when they take inspiration (and design choices)
from published works by others who presumably published <em>only the best of the many models they evaluated on the test set.</em></p>
<p>Dwork et al. showed that if the test set has size $N$, and the designer is allowed to see the error of the first $i-1$ models on the test set before designing the $i$’th model, then a clever designer can use so-called <a href="https://arxiv.org/pdf/1502.04585.pdf"><em>wacky boosting</em></a> (see this <a href="http://blog.mrtz.org/2015/03/09/competition.html">blog post</a>) to ensure the accuracy of the $t$’th model on the test set as high as $\Omega(\sqrt{t/N})$. In other words, the test set could become essentially useless once $t \gg N$, a
condition that holds in ML, whereby in popular datasets (CIFAR10, CIFAR100, ImageNet etc.) $N$ is no more than $100,000$ and the total number of models being trained
world-wide is well in the millions if not higher (once you include hyperparameter searches).</p>
<blockquote>
<p><strong>Meta-overfitting Error (MOE)</strong> of a model is the difference between its average error on the test data and its expected error on the full distribution.
(It is closely related to <a href="https://en.wikipedia.org/wiki/False_discovery_rate"><em>false discovery rate</em></a> in statistics.)</p>
</blockquote>
<p>This blog post concerns <a href="https://arxiv.org/pdf/2102.13189.pdf">our new paper</a>, which gives meaningful upper bounds on this sort of trouble for popular
deep net architectures, whereas prior ideas from adaptive data analysis gave no nontrivial estimates. We call our estimate <em>Rip van Winkle’s Razor</em>
which combines references to <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam’s Razor</a> and the
<a href="https://en.wikipedia.org/wiki/Rip_Van_Winkle">mythical person who fell asleep for 20 years</a>.</p>
<figure align="center">
<img src="http://www.offconvex.org/assets/ripvanwinkle.jpg" alt="drawing" width="50%" />
<figcaption> Rip Van Winkle wakes up from 20 years of sleep, clearly needing a Razor </figcaption>
</figure>
<h2 id="adaptive-data-analysis-brief-tour">Adaptive Data Analysis: Brief tour</h2>
<p>It is well-known that for a model trained <strong>without</strong> ever querying the test set, MOE scales (with high probability over choice of the test set) as $1/\sqrt{N}$ where $N$
is the size of the test set. Furthermore standard concentration bounds imply that even if we train $t$ models without ever referring to the test set (in other words,
using proper data hygiene) then the maximum meta-overfitting error among the $t$ models scales whp as $O(\sqrt{\log(t)/ N})$. The trouble pinpointed by Dwork et al.
can happen only if models are designed adaptively, with test error of the previous models shaping the design of the next model.</p>
<p>Adaptive Data Analysis has come up with many good practices for honest researchers to mitigate such issues. For instance, Dwork et al. showed that using
Differential Privacy on labels while evaluating models can lower MOE. Or the <a href="https://arxiv.org/pdf/1502.04585.pdf">Ladder mechanism</a> helps in Kaggle-like
settings where the test dataset resides on a server that can choose to answers only a selected subset of queries, which essentially takes away the MOE issue.</p>
<p>For several good practices matching lower bounds exist showing a way to construct cheating models with MOE matching the upper bound.</p>
<p>However such recommended best practices do not help with understanding the MOE in the performance numbers of a new model since there is no guarantee that the
inventors never tuned models using the test set, or didn’t get inspiration from existing models that may have been designed that way. Thus statistically
speaking the above results still give no reason to believe that a modern deep net such as ResNet152 has low MOE.</p>
<p><a href="http://proceedings.mlr.press/v97/recht19a/recht19a.pdf">Recht et al. 2019</a> summed up the MOE issue in a catchy title: <em>Do ImageNet Classifiers Generalize to ImageNet?</em> They tried to answer their question experimentally by creating new test sets from scratch –we discuss their results later.</p>
<h2 id="moe-bounds-and-description-length">MOE bounds and description length</h2>
<p>The starting point of our work is the following classical concentration bounds:</p>
<blockquote>
<p><strong>Folklore Theorem</strong> With high probability over the choice of a test set of size $N$, the MOE of <em>all</em> models with description length at most $k$ bits is $O(\sqrt{k/N})$.</p>
</blockquote>
<p>At first sight this doesn’t seem to help us because one cannot imagine modern deep nets having a short description. The most obvious description involves reporting
values of the net parameters, which requires millions or even hundreds of millions of bits, resulting in a vacuous upper bound on MOE.</p>
<p>Another obvious description would be the computer program used to produce the model using the (publicly available) training and validation sets. However, these
programs usually rely on imported libraries through layers of encapsulation and so the effective program size is pretty large as well.</p>
<h2 id="rip-van-winkles-razor">Rip van Winkle’s Razor</h2>
<p>Our new upper bound involves a more careful definition of <em>Description Length</em>: it is the smallest description that allows a referee to reproduce a model of
similar performance using the (universally available) training and validation datasets.</p>
<p>While this phrasing may appear reminiscent of the review process for conferences and journals, there is a subtle difference with respect to what the referee
can or cannot be assumed to know. (Clearly, assumptions about the referee can greatly affect description length —e.g, a referee ignorant of even basic
calculus might need a very long explanation!)</p>
<blockquote>
<p><strong>Informed Referee:</strong> “Knows everything that was known to humanity (e.g., about deep learning, mathematics,optimization, statistics etc.) right up to the
moment of creation of the Test set.”</p>
</blockquote>
<blockquote>
<p><strong>Unbiased Referee:</strong> Knows nothing discovered since the Test set was created.</p>
</blockquote>
<p>Thus <em>Description Length</em> of a model is the number of bits in the shortest description that allows an informed but unbiased referee to reproduce the claimed result.</p>
<p>Note that informed referees let descriptions get shorter. Unbiased require longer descriptions that rule out any statistical “contamination” due to any interaction whatsoever with the test set. For example, momentum techniques in optimization were
well-studied before the creation of ImageNet test set, so informed referees can be expected to understand a line like “SGD with momentum 0.9.” But a
line like “Use Batch Normalization” cannot be understood by unbiased referees since conceivably this technique (invented after 2012) might have
become popular precisely because it leads to better performance on the test set of ImageNet.</p>
<p>By now it should be clear why the estimate is named after <a href="https://en.wikipedia.org/wiki/Rip_Van_Winkle">“Rip van Winkle”</a>: the referee can be thought
of as an infinitely well-informed researcher who went into deep sleep at the moment of creation of the test set, and has just been woken up years later
to start refereeing the latest papers. Real-life journal referees who luckily did not suffer this way should try to simulate the idealized Rip van Winkle
in their heads while perusing the description submitted by the researcher.</p>
<p>To allow as short a description as possible the researcher is allowed to compress the description of their new deep net non-destructively using any compression that would make sense to Rip van Winkle (e.g., <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman Coding</a>). The description of the compression method itself
is not counted towards the description length – provided the same method is used for all papers submitted to Rip van Winkle. To give an example, a
technique appearing in a text known to Rip van Winkle could be succinctly referred to using the book’s ISBN number and page number.</p>
<h2 id="estimating-moe-of-resnet-152">Estimating MOE of ResNet-152</h2>
<p>As an illustration, here we provide a suitable description allowing Rip van Winkle to reproduce a mainstream ImageNet model, ResNet-152, which achieves $4.49\%$ top-5
test error.</p>
<p>The description consists of three types of expressions: English phrases, Math equations, and directed graphs. In the paper, we describe in detail how to encode
each of them into binary strings and count their lengths. The allowed vocabulary includes primitive concepts that were known before 2012, such
as <em>CONV, MaxPool, ReLU, SGD</em> etc., as well as a graph-theoretic notation/shorthand for describing net architecture. The newly introduced concepts
including <em>Batch-Norm</em>, <em>Layer, Block</em> are defined precisely using Math, English, and other primitive concepts.</p>
<figure align="center">
<img src="http://www.offconvex.org/assets/resnet_description.png" alt="drawing" width="80%" />
<figcaption><b>Description for reproducing ResNet-152</b></figcaption>
</figure>
<p>According to our estimate, the length of the above description is $1032$ bits, which translates into a upper bound on meta-overfitting error of merely $5\%$!
This suggests the real top-5 error of the model on full distribution is at most $9.49\%$. In the paper we also provide a $980$-bit long description for
reproducing DenseNet-264, which leads to $5.06\%$ upper bound on its meta-overfitting error.</p>
<p>Note that the number $5.06$ suggests higher precision than actually given by the method, since it is possible to quibble about the coding assumptions
that led to it. Perhaps others might use a more classical coding mechanism and obtain an estimate of $6\%$ or $7\%$.</p>
<p>But the important point is that unlike existing bounds in Adaptive Data Analysis, there is <strong>no</strong> dependence on $t$, the number of models that have been tested before, and the bound is non-vacuous.</p>
<h2 id="empirical-evidence-about-lack-of-meta-overfitting">Empirical evidence about lack of meta-overfitting</h2>
<p>Our estimates indicate that the issue of meta-overfitting on ImageNet for these mainstream models is mild. The reason is that despite the vast number
of parameters and hyper-parameters in today’s deep nets, the <em>information content</em> of these models is not high given knowledge circa 2012.</p>
<p>Recently Recht et al. <a href="https://arxiv.org/abs/1902.10811">tried to reach an empirical upper bound on MOE</a> for
ImageNet and <a href="https://arxiv.org/abs/1806.00451">CIFAR-10</a>. They created new tests sets by carefully replicating the methodology used for constructing the original ones. They found that error of famous published models of the past seven years is as much as 10-15% higher on the new test set as compared to the original. On the face of it, this seemed to confirm a case of bad meta-overfitting. But they also presented evidence that the swing in test error was due to systemic effects during test set creation. For instance, a comparable swing happens also for models that predated the creation of ImageNet (and thus were not overfitted to the ImageNet test set).
<a href="https://proceedings.neurips.cc/paper/2019/hash/ee39e503b6bedf0c98c388b7e8589aca-Abstract.html">A followup study</a> of a hundred Kaggle competitions used fresh,
identically distributed test sets that were available from the official competition organizers. The authors concluded that MOE does not appear to be significant in modern ML.</p>
<h2 id="conclusions">Conclusions</h2>
<p>To us the disquieting takeaway from Recht et al.’s results was that estimating MOE by creating a new test set is rife with systematic bias at best, and perhaps impossible, especially in datasets concerning rare or one-time phenomena (e.g., stock prices). Thus their work still left a pressing need for effective upper bounds on meta-overfitting error. Our Rip van Winkle’s Razor is elementary, and easily deployable by the average researcher. We hope it becomes part of the standard toolbox in Adaptive Data Analysis.</p>
Wed, 07 Apr 2021 14:00:00 -0700
http://offconvex.github.io/2021/04/07/ripvanwinkle/
http://offconvex.github.io/2021/04/07/ripvanwinkle/When are Neural Networks more powerful than Neural Tangent Kernels?<p>The empirical success of deep learning has posed significant challenges to machine learning theory: Why can we efficiently train neural networks with gradient descent despite its highly non-convex optimization landscape? Why do over-parametrized networks generalize well? The recently proposed Neural Tangent Kernel (NTK) theory offers a powerful framework for understanding these, but yet still comes with its limitations.</p>
<p>In this blog post, we explore how to analyze wide neural networks beyond the NTK theory, based on our recent <a href="https://arxiv.org/abs/1910.01619">Beyond Linearization paper</a> and follow-up <a href="https://arxiv.org/abs/2006.13436">paper on understanding hierarchical learning</a>. (This blog post is also cross-posted at the <a href="https://blog.einstein.ai/beyond-ntk/">Salesforce Research blog</a>.)</p>
<h3 id="neural-tangent-kernels">Neural Tangent Kernels</h3>
<p>The Neural Tangent Kernel (NTK) is a recently proposed theoretical framework for establishing provable convergence and generalization guarantees for wide (over-parametrized) neural networks <a href="https://arxiv.org/abs/1806.07572">(Jacot et al. 2018)</a>. Roughly speaking, the NTK theory shows that</p>
<ul>
<li>A sufficiently wide neural network trains like a linearized model governed by the derivative of the network with respect to its parameters.</li>
<li>At the infinite-width limit, this linearized model becomes a kernel predictor with the Neural Tangent Kernel (the NTK).</li>
</ul>
<p>Consequently, a wide neural network trained with small learning rate converges to 0 training loss and generalize as well as the infinite-width kernel predictor. For a detailed introduction to the NTK, please refer to the earlier <a href="http://www.offconvex.org/2019/10/03/NTK/">blog post</a> by Wei and Simon.</p>
<h3 id="does-ntk-fully-explain-the-success-of-neural-networks">Does NTK fully explain the success of neural networks?</h3>
<p>Although the NTK yields powerful theoretical results, it turns out that real-world deep learning <em>do not operate in the NTK regime</em>:</p>
<ul>
<li>Empirically, infinite-width NTK kernel predictors perform slightly worse (though competitive) than fully trained neural networks on benchmark tasks such as CIFAR-10 <a href="https://arxiv.org/abs/1904.11955">(Arora et al. 2019b)</a>. For finite width networks in practice, this gap is even more profound, as we see in Figure 1: The linearized network is a rather poor approximation of the fully trained network at practical optimization setups such as large initial learning rate <a href="https://arxiv.org/abs/2002.04010">(Bai et al. 2020)</a>.</li>
<li>Theoretically, the NTK has poor <em>sample complexity for learning certain simple functions</em>. Though the NTK is a universal kernel that can interpolate any finite, non-degenerate training dataset <a href="https://arxiv.org/abs/1810.02054">(Du et al. 2018</a><a href="https://arxiv.org/abs/1811.03804">, 2019)</a>, the test error of this kernel predictor scales with the RKHS norm of the ground truth function. For certain non-smooth but simple functions such as a single ReLU, this norm can be exponentially large in the feature dimension <a href="https://arxiv.org/abs/1904.00687">(Yehudai & Shamir 2019)</a>. Consequently, NTK analyses yield poor sample complexity upper bounds for learning such functions, whereas empirically neural nets only require a mild sample size <a href="https://arxiv.org/abs/1410.1141">(Livni et al. 2014)</a>.</li>
</ul>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/taylor-plot.png" />
<br />
<i><b>Figure 1.</b>
Linearized model does not closely approximate the training trajectory of neural networks with practical optimization setups, whereas higher order Taylor models offer a substantially better approximation.
</i>
<br />
<br />
</div>
<p>These gaps urge us to ask the following</p>
<blockquote>
<p><strong>Question</strong>: How can we theoretically study neural networks beyond the NTK regime? Can we prove that neural networks outperform the NTK on certain learning tasks?</p>
</blockquote>
<p>The key technical question here is to mathematically understand neural networks operating <em>outside of the NTK regime</em>.</p>
<h2 id="higher-order-taylor-expansion">Higher-order Taylor expansion</h2>
<p>Our main tool for going beyond the NTK is the <em>Taylor expansion</em>. Consider a two-layer neural network with $m$ neurons, where we only train the “bottom” nonlinear layer $W$:</p>
\[f_{W_0 + W}(x) = \frac{1}{\sqrt{m}} \sum_{r=1}^m a_r \sigma( (w_{0,r} + w_r)^\top x).\]
<p>(Here, $W_0+W$ is an $m\times d$ weight matrix, where $W_0$ denotes the random initialization and $W$ denotes the trainable “movement” matrix initialized at zero). For small enough $W$, we can perform a Taylor expansion of the network around $W_0$ and get</p>
\[f_{W_0+W}(x) = \frac{1}{\sqrt{m}} \sum_{r=1}^m a_r \sigma(w_{0,r}^\top x) + \sum_{k=1}^\infty \frac{1}{\sqrt{m}} \sum_{r=1}^m a_r \frac{\sigma^{(k)} (w_{0,r}^\top x)}{k!} (w_r^\top x)^k\]
<p>Let us denote the $k$-th order term as $ f^{(k)}_{W_0, W}$, and rewrite this as</p>
\[f_{W_0+W}(x) = f^{(0)}_{W_0}(x) + \sum_{k=1}^\infty f^{(k)}_{W_0, W}(x).\]
<p>Above, term $f^{(k)}$ is a $k$-th order polynomial of the trainable parameter $W$. For the moment assume that $f^{(0)}(x)=0$ (this can be achieved via techniques such as the symmetric initialization).</p>
<p>The key insight of the NTK theory can be described as the following <strong>linearized approximation</strong> property</p>
<blockquote>
<p>For small enough $W$, the neural network $f_{W_0,W}$ is closely approximated by the linear model $f^{(1)}$.</p>
</blockquote>
<p>Towards moving beyond the linearized approximation, in our <a href="https://arxiv.org/abs/1910.01619">Beyond Linearization paper</a>, we start by asking</p>
<blockquote>
<p>Why just $f^{(1)}$? Can we also utilize the higher-order term in the Taylor series such as $f^{(2)}$?</p>
</blockquote>
<p>At first sight, this seems rather unlikely, as in Taylor expansions we always expect the linear term $f^{(1)}$ to dominate the whole expansion and have a larger magnitude than $f^{(2)}$ (and subsequent terms).</p>
<h3 id="killing-the-ntk-term-by-randomized-coupling">“Killing” the NTK term by randomized coupling</h3>
<p>We bring forward the idea of <em>randomization</em>, which helps us escape the “domination” of $f^{(1)}$ and couple neural networks with their quadratic Taylor expansion term $f^{(2)}$. This idea appeared first in <a href="https://arxiv.org/abs/1811.04918">Allen-Zhu et al. (2018)</a> for analyzing three-layer networks, and as we will show also applies to two-layer networks in a perhaps more intuitive fashion.</p>
<p>Let us now assign each weight movement $w_r$ with a <em>random sign</em> $s_r\in\{\pm 1\}$, and consider the randomized weights $\{s_rw_r\}$. The random signs satisfy the following basic properties:</p>
\[E[s_r]=0 \quad {\rm and} \quad s_r^2 \equiv 1.\]
<p>Therefore, let $SW\in\mathbb{R}^{m\times d}$ denote the randomized weight matrix, we can compare the first and second order terms in the Taylor expansion at $SW$:</p>
\[E_{S} \left[f^{(1)}_{W_0, SW}(x)\right] = E_{S} \left[ \frac{1}{\sqrt{m}}\sum_{r\le m} a_r \sigma'(w_{0,r}^\top x) (s_rw_r^\top x) \right] = 0,\]
<p>whereas</p>
\[f^{(2)}_{W_0, SW}(x) = \frac{1}{\sqrt{m}}\sum_{r\le m} a_r \frac{\sigma^{(2)}(w_{0,r}^\top x)}{2} (s_rw_r^\top x)^2 = \frac{1}{\sqrt{m}}\sum_{r\le m} a_r \frac{\sigma^{(2)}(w_{0,r}^\top x)}{2} (w_r^\top x)^2 = f^{(2)}_{W_0, W}(x).\]
<p>Observe that the sign randomization keeps the quadratic term $f^{(2)}$ unchanged, but “kills” the linear term $f^{(1)}$ in expectation!</p>
<p>If we train such a randomized network with freshly sampled signs $S$ at each iteration, the linear term $f^{(1)}$ will keep oscillating around zero and does not have any power in fitting the data, whereas the quadratic term is not affected at all and thus becomes the leading force for fitting the data. (The keen reader may notice that this randomization is similar to Dropout, with the key difference being that we randomize the weight <em>movement</em> matrix, whereas vanilla Dropout randomizes the weight matrix itself.)</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/beyond-ntk.png" />
<br />
<i><b>Figure 2.</b>
The NTK regime operates in the "NTK ball" where the network is approximately equal to the linear term. The quadratic regime operates in a larger ball where the network is approximately equal to the sum of first two terms, but the linear term dominates and can blow up at large width. Our randomized coupling technique resolves this by introducing the random sign matrix that in expectation "kills" the linear term but always preserves the quadratic term.
</i>
<br />
<br />
</div>
<p>Our first result shows that networks with sign randomization can still be efficiently optimized, despite its now non-convex optimization landscape:</p>
<blockquote>
<p><strong>Theorem</strong>: Any escaping-saddle algorithm (e.g. noisy SGD) on the regularized loss function $E_S[L(W_0+SW)]+R(W)$, with freshly sampled sign $S=S_t$ per iteration, can find the global minimum in polynomial time.</p>
</blockquote>
<p>The proof builds on the quadratic approximation $E_S[f]\approx f^{(2)}$ and recent understandings on neural networks with quadratic activation, e.g. <a href="https://arxiv.org/abs/1707.04926">Soltanolkotabi et al. (2017)</a> & <a href="https://arxiv.org/abs/1803.01206">Du and Lee (2018)</a>.</p>
<h3 id="generalization-and-sample-complexity-case-study-on-learning-low-rank-polynomials">Generalization and sample complexity: Case study on learning low-rank polynomials</h3>
<p>We next study the generalization of these networks in the context of learning <em>low-rank degree-$p$ polynomials</em>:</p>
\[f_\star(x) = \sum_{s=1}^{r_\star} \alpha_s (\beta_s^\top x)^{p_s}, \quad |\alpha_s|\le 1,\|(\beta_s^\top x)^{p_s}\|_{L_2} \le 1, p_s\le p \quad \textrm{for all } s.\]
<p>We are specifically interested in the case where $r_\star$ is small (e.g. $O(1)$), so that $y$ only depends on the projection of $x$ on a few directions. This for example captures teacher networks with polynomial activation of bounded degree and analytic activation (approximately), as well as constant depth teacher networks with polynomial activations.</p>
<p>For the NTK, the sample complexity of learning polynomials have been studied extensively in <a href="https://arxiv.org/abs/1901.08584">(Arora et al. 2019a)</a>, <a href="https://arxiv.org/abs/1904.12191">(Ghorbani et al. 2019)</a>, and many concurrent work. Combined, they showed that the sample complexity for learning degree-$p$ polynomials is $\Theta(d^p)$, with matching lower and upper bounds:</p>
<blockquote>
<p><strong>Theorem (NTK)</strong> : Suppose $x$ is uniformly distributed on the sphere, then the NTK requires $O(d^p)$ samples in order to achieve a small test error for learning any degree-$p$ polynomial, and there is a matching lower bound of $\Omega(d^p)$ for any inner-product kernel method.</p>
</blockquote>
<p>In our <a href="https://arxiv.org/abs/1910.0161">Beyond Linearization paper</a>, we show that the quadratic Taylor model achieves an improved sample complexity of $\tilde{O}(d^{p-1})$ with isotropic inputs:</p>
<blockquote>
<p><strong>Theorem (Quadratic Model)</strong>: For mildly isotropic input distributions, the two-layer quadratic Taylor model (or two-layer NN with sign randomization) only requires $\tilde{O}({\rm poly}(r_\star, p)d^{p-1})$ samples in order to achieve a small test error for learning a low-rank degree-$p$ polynomial.</p>
</blockquote>
<p>In our <a href="https://arxiv.org/abs/2006.13436">follow-up paper on understanding hierarchical learning</a>, we further design a “hierarchical learner” using a specific three-layer network, and show the following</p>
<blockquote>
<p><strong>Theorem (Three-layer hierarchical model)</strong>: Under mild input distribution assumptions, a three-layer network with a fixed representation layer of width $D=d^{p/2}$ and a trainable quadratic Taylor layer can achieve a small test error using only $\tilde{O}({\rm poly}(r_\star, p)d^{p/2})$ samples.</p>
</blockquote>
<p>When $r_\star,p=O(1)$, the quadratic Taylor model can improve over the NTK by a multiplicative factor of $d$, and we can further get a substantially larger improvement of $d^{p/2}$ by using the three-layer hierarchical learner. Here we briefly discuss the proof intuitions, and refer the reader to our papers for more details.</p>
<ul>
<li>
<p><strong>Generalization bounds</strong>: We show that, while the NTK and quadratic Taylor model expresses functions using similar random feature constructions, their generalization depends differently on the norm of the input. In the NTK, the generalization depends on the L2 norm of the features (as well as the weights), whereas generalization of the quadratic Taylor model depends on the operator norm of the input matrix features $\frac{1}{n}\sum x_ix_i^\top$ times the nuclear norm of $\sum w_rw_r^\top$. It turns out that this decomposition can match the one given by the NTK (it is never worse), and in addition be better by a factor of $O(\sqrt{d})$ if the input distribution is mildly isotropic so that $\|\frac{1}{n}\sum x_ix_i^\top\|_{\rm op} \le 1/\sqrt{d} \cdot \max \|x_i\|_2^2$, leading to the $O(d)$ improvement in the sample complexity.</p>
</li>
<li>
<p><strong>Hierarchical learning</strong>: The key intuition behind the hierarchical learner is that we can utilize the $O(d)$ sample complexity gain to its fullest, by applying quadratic Taylor model to not the input $x$, but a feature representation $h(x)\in \mathbb{R}^D$ where $D\gg d$. This yields a gain as long as $h$ is rich enough to express $f_\star$ and also isotropic enough to let the operator norm $\|\frac{1}{n}\sum h(x_i)h(x_i)^\top\|_{\rm op}$ be nice. In particular, for learning degree-$p$ polynomials, the best we can do is to choose $D=d^{p/2}$, leading to a sample complexity saving of $\tilde{O}(D)=\tilde{O}(d^{p/2})$.</p>
</li>
</ul>
<h3 id="concluding-thoughts">Concluding thoughts</h3>
<p>In this post, we explored higher-order Taylor expansions (in particular the quadratic expansion) as an approach to deep learning theory beyond the NTK regime. The Taylorization approach has several advantages:</p>
<ul>
<li>Non-convex but benign optimization landscape;</li>
<li>Provable generalization benefits over NTKs;</li>
<li>Ability of modeling hierarchical learning;</li>
<li>Convenient API for expeirmentation (cf. the <a href="https://github.com/google/neural-tangents">Neural Tangents</a> package and the <a href="https://arxiv.org/abs/2002.04010">Taylorized training</a> paper).</li>
</ul>
<p>We believe these advantages make the Taylor expansion a powerful tool for deep learning theory, and our results are just a beginning. We also remark that there are other theoretical frameworks such as the <a href="https://arxiv.org/abs/1909.08156">Neural Tangent Hierarchy</a> or the <a href="https://arxiv.org/abs/1804.06561">Mean-Field Theory</a> that go beyond the NTK with their own advantages in various angles, but without computational efficiency guarantees. See the <a href="https://jasondlee88.github.io/slides/beyond_ntk.pdf">slides</a> for more on going beyond NTK. Making progress on any of these directions (or coming up with new ones) would be an exciting direction for future work.</p>
Thu, 25 Mar 2021 07:00:00 -0700
http://offconvex.github.io/2021/03/25/beyondNTK/
http://offconvex.github.io/2021/03/25/beyondNTK/Beyond log-concave sampling (Part 3)<p>In the <a href="http://www.offconvex.org/2020/09/19/beyondlogconvavesampling">first post</a> of this series, we introduced the challenges of sampling distributions beyond log-concavity. In <a href="http://www.offconvex.org/2021/03/01/beyondlogconcave2/">Part 2</a> we tackled sampling from <em>multimodal</em> distributions: a typical obstacle occuring in problems involving statistical inference and posterior sampling in generative models. In this (final) post of the series, we consider sampling in the presence of <em>manifold structure in the level sets of the distribution</em> – which also frequently manifests in the same settings. It will cover the paper <a href="https://arxiv.org/abs/2002.05576">Fast convergence for Langevin diffusion with matrix manifold structure</a> by Ankur Moitra and Andrej Risteski .</p>
<h1 id="sampling-with-matrix-manifold-structure">Sampling with matrix manifold structure</h1>
<p>The structure on the distribution we consider in this post is <em>manifolds</em> of equiprobable points: this is natural, for instance, in the presence of invariances in data (e.g. rotations of images). It can also appear in neural-network based probabilistic models due to natural invariances they encode (e.g., scaling invariances in ReLU-based networks).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/manifold.jpg" width="400" />
</center>
<!--[HL: would it be better to start with the decomposition theorem, to parallel the first section?]
[AR: let me try this first, we can reorganize. I think this is better to motivate the assumption somewhat]-->
<p>At the level of techniques, the starting point for our results is a close connection between the geometry, more precisely <em>Ricci curvature</em> of a manifold, and the mixing time of Brownian motion on a manifold. The following theorem holds:</p>
<blockquote>
<p><strong>Theorem (Bakry and Émery ‘85, informal)</strong>: If the manifold $M$ has positive Ricci curvature, Brownian motion on the manifold mixes rapidly in $\chi^2$ divergence.</p>
</blockquote>
<p>We will explain the notions from differential geometry shortly, but first we sketch our results, and how they use this machinery. We present two results: the first is a “meta”-theorem that provides a generic decomposition framework, and the second is an instantiation of this framework for a natural family of problems that exhibit manifold structure: posteriors for matrix factorization, sensing, and completion.
<!--a sampling version of matrix factorization--></p>
<h2 id="a-general-manifold-decomposition-framework">A general manifold decomposition framework</h2>
<p>Our first result is a general decomposition framework for analyzing mixing time of Langevin in the presence of manifolds of equiprobable points.</p>
<p>To motivate the result, note that if we consider the distribution $p_{\beta}(x) \propto e^{-\beta f(x)}$, for large (but finite) $\beta$, the Langevin chain corresponding to that distribution, started close to a manifold of local minima, will tend to stay close to (but not on!) it for a long time. See the figure below for an illustration. This, we will state a “robust” version of the above manifold result, for a chain that’s allowed to go off the manifold.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/single_manifold.gif" width="300" />
</center>
<p>We show the following statement. (Recall that a bounded Poincaré constant corresponds to rapid mixing for Langevin. See the <a href="http://www.offconvex.org/2020/09/19/beyondlogconvavesampling">first post</a> for a refresher.)</p>
<blockquote>
<p><strong>Theorem 1 (Moitra and Risteski ‘20, informal)</strong>:
Suppose the Langevin chain corresponding to $p(x) \propto e^{-f(x)}$ is initialized close to a manifold $M$ satisfying the following two properties:
<br /><br />
(1) It stays in some neighborhood $D$ of the manifold $M$ with large probability for a long time.
<br /><br />
(2) $D$ can be partitioned into manifolds $M^{\Delta}$ satisfying:
<br /><br />
(2.1) The conditional distribution of $p$ restricted to $M^{\Delta}$ has a upper bounded Poincare constant.
<br /><br />
(2.2) The marginal distribution over $\Delta$ has a upper bounded Poincare constant.
<br /><br />
(2.3) The conditional probability distribution over $M^{\Delta}$ does not “change too quickly” as $\Delta$ changes.
<br /><br />
Then Langevin mixes quickly to a distribution close to the conditional distribution of $p$ restricted to $D$.</p>
</blockquote>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/partition_illustration.gif" />
</center>
<p>While the above theorem is a bit of a mouthful (even very informally stated) and requires a choice of partitioning of $D$ to be “instantiated”, it’s quite natural to think of it as an analogue of local convergence results for gradient descent in optimization. Namely, it gives geometric conditions under which Langevin started near a manifold mixes to the “local” stationary distribution (i.e. the conditional distribution $p$ restricted to $D$).</p>
<p>The proof of the theorem uses similar decomposition ideas as result on sampling multimodal distributions from the <a href="http://www.offconvex.org/2021/03/01/beyondlogconcave2/">previous post</a>, albeit is complicated by measure theoretic arguments. Namely, the manifolds $M^{\Delta}$ have technically zero measure under the distribution $p$, so care must be taken with how the “projected” and “restricted” chain are defined—the key tool for this is the so-called <a href="https://en.wikipedia.org/wiki/Smooth_coarea_formula#:~:text=In%20Riemannian%20geometry%2C%20the%20smooth,with%20integrals%20over%20their%20codomains.&text=%2C%20i.e.%20the%20determinant%20of%20the,orthogonal%20complement%20of%20its%20kernel.">co-area formula</a>.</p>
<p>The challenge in using the above framework is instantiating the decomposition: namely, the choice of the partition of $D$ into manifolds $M^{\Delta}$. In the next section, we show how this can be done for posteriors in problems like matrix factorization/sensing/completion.</p>
<h2 id="matrix-factorization-and-relatives">Matrix factorization (and relatives)</h2>
<p>To instantiate the above framework in a natural setting, we consider distributions exhibiting invariance under orthogonal transformations. Namely, we consider distributions of the type</p>
\[p: \mathbb{R}^{d \times k} \to \mathbb{R}, \hspace{0.5cm} p(X) \propto e^{-\beta \| \mathcal{A}(XX^T) - b \|^2_2}\]
<p>where $b \in \mathbb{R}^{m}$ is a fixed vector and $\mathcal{A}$ is an operator that returns a $m$-dimensional vector given a $d \times d$ matrix. For this distribution, we have $p(X) = p(XO)$ for any orthogonal matrix $O$, since $XX^T = XO (XO)^T$ . Depending on the choice of $\mathcal{A}$, we can easily recover some familiar functions inside the exponential: e.g. the $l_2$ losses for (low-rank) matrix factorization, matrix sensing and matrix completion. These losses received a lot of attention as simple examples of objectives that are non-convex but can still be optimized using gradient descent. (See e.g. <a href="https://arxiv.org/abs/1704.00708">Ge et al. ‘17</a>.)</p>
<p>These distributions also have a very natural statistical motivation. Namely, consider the distribution over $m$-dimensional vectors, such that</p>
\[b = \mathcal{A}(XX^T) + n, \hspace{0.5cm} n \sim N\left(0,\frac{1}{\sqrt{\beta}}I\right).\]
<p>Then, the distribution $p(X) \propto e^{-\beta | \mathcal{A}(XX^T) - b |^2_2 }$ can be viewed as the posterior distribution over $X$ with a uniform prior. Thus, sampling from these distributions can be seen as the distributional analogue of problems like matrix factorization/sensing/completion, the difference being that we are not merely trying to find the <em>most likely</em> matrix $X$, but also trying to sample from the posterior.</p>
<p>We will consider the case when $\beta$ is sufficiently large (in particular, $\beta = \Omega(\mbox{poly}(d))$: in this case, the distribution $p$ will concentrate over two (separated) manifolds: $E_1 = \{X_0 R: R \mbox{ is orthogonal with det 1}\}$ and $E_2 = \{X_0 R: R \mbox{ is orthogonal with det }-1\}$, where $X_0$ is any fixed minimizer of $| \mathcal{A}(XX^T) - b |^2_2$. Hence, when started near one of these manifolds, we expect Langevin to stay close to it for a long time (see figure below).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/langevin_matrix.gif" width="500" />
</center>
<p>We show:</p>
<blockquote>
<p><strong>Theorem 2 (Moitra and Risteski ‘20, informal)</strong>: Let $\mathcal{A}$ correspond to matrix factorization, sensing or completion under standard parameter assumptions for these problems. Let $\beta = \Omega(\mbox{poly}(d))$.
If initialized close to one of $E_i, i \in \{1, 2\}$, after a polynomial number of steps the discretized Langevin dynamics will converge to a distribution that is close in total variation distance
to p(X) when restricted to a neighborhood of $E_i$.</p>
</blockquote>
<!--Let $\beta=\Omega(\mbox{poly}(d), \log (1/\delta))$ and let $p^i(X) \propto p(𝑋)$, if $\|X −\Pi_{E_i}(𝑋)\|_2<\|X − \Pi_{E_{2−i}}(X)\|_2$ and $p^i (X)=0$ otherwise.
Then, Langevin diffusion initialized $O(\sigma_{\min}(M)/k)$ close to $E_i$ run for t steps samples from distribution $p_t$, s.t.
$$ \chi^2(p_t, p^i ) \leq \delta + e^{−𝑡/C}, 𝐶=O\left(\frac{\beta}{(k \sigma_{\min}(M))}\right) $$-->
<p>We remark that the closeness condition for the first step is easy to ensure using existing results on gradient-descent based optimization for these objectives. It’s also easy to use the above result to sample approximately from the distribution $p$ itself, rather than only the “local” distributions $p^i$ – this is due to the fact that the distribution $p$ looks like the “disjoint union” of the distributions $p^1$ and $p^2$.</p>
<p>Before we describe the main elements of the proof, we review some concepts from differential geometry.</p>
<h2 id="extremely-brief-intro-to-differential-geometry">(Extremely) brief intro to differential geometry</h2>
<p>We won’t do a full primer on differential geometry in this blog post, but we will briefly informally describe some of the relevant concepts. See Section 5 of <a href="https://arxiv.org/abs/2002.05576">our paper</a> for an intro to differential geometry (written with a computer science reader in mind, so more easy-going than a differential geometry textbook).</p>
<p>Recall, the <em>tangent</em> space at $x$, denoted by $T_x M$, is the set of all derivatives $v$ of a curve passing through $x$.<!-- the exponential map at a point $x$ in direction $v \in T_x M$, denoted by $\exp_x(v)$ is the movement of $x$ along the *geodesic* (i.e. shortest path curve) at $x$ for one unit of time (note this curve is unique). See the left part of the figure below.--><!-- To explain Ricci curvature, consider first the intuitive concept of curvature: Euclidean space has zero curvature, while the sphere has positive curvature because it "folds into itself." One way to capture this is by looking at the volume of a geodesic ball around a point: the sphere's positive curvature causes the volume to be *less* than the volume in Euclidean space. We can take this change in volume as the *definition* of curvature.--> The <em>Ricci curvature</em> at a point $x$, in direction $v \in T_x M$, denoted $\mbox{Ric}_x(v)$, captures the second-order term in the rate of change of volumes of sets in a small neighborhood around $x$, as the points in the set are moved along the geodesic (i.e. shortest path curve) in direction $v$ (or more precisely, each point $y$ in the set is moved along the geodesic in the direction of the parallel transport of $v$ at $y$; see the right part of the figure below from <a href="https://projecteuclid.org/euclid.aspm/1543086328">(Ollivier 2010)</a>). A Ricci curvature of $0$ preserves volumes (think: a plane), a Ricci curvature $>0$ shrinks volume (think: a sphere), and a Ricci curvature $<0$ expands volume (think: a hyperbola).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/tangent.jpg" width="800" />
</center>
<!--Slightly more mathematically, it's relatively easy to understand the Ricci curvature when we have a parametrized manifold. The curvature of the manifold should be intuitively captured by the second-order behavior of the parametrization. Namely, consider a manifold parametrized locally as
$$\phi: T_x M \to T_x M \times N_x M, \phi(z) = x + (z, g(z))$$
where $N_xM$ is the *normal* space at $x$, the subspace orthogonal to $T_xM$.
Then, the Hessian viewed as the quadratic form $\nabla^2 g: T_x M \times T_x M \to N_x M$ is called the *second fundamental form* and denoted as $\mathrm{I\!I}_x$. If $\{e_i\}$ is an orthonormal basis of $T_x M$, the Ricci curvature in a direction $v \in T_x M$ is then:
$$\mbox{Ric}(v) = \sum_i \langle \mathrm{I\!I}(u,u), \mathrm{I\!I}(e_i, e_i) \rangle - \|\mathrm{I\!I}(u,e_i)\|^2$$
-->
<p>The connections between curvature and mixing time of diffiusions is rather deep and we won’t attempt to convey it fully in a blog post - the definitive reference is <a href="https://link.springer.com/book/10.1007/978-3-319-00227-9">Analysis and Geometry of Markov Diffusion Operators</a> by Bakry, Gentil and Ledoux. The main idea is that mixing time can be bounded by how long it takes for random walks starting at different locations to “join together,” and positive curvature brings them together faster.
<!-- A small gesture towards these connections can be conveyed through a popular coupling for Brownian motion called a *reflection coupling*.--></p>
<p>To make this formal, we define a <em>coupling</em> of two random variables $X, Y$ to be any random variable $W = (X’,Y’)$ such that the marginal distribution of the coordinates $X’$ and $Y’$ are the same as the distributions of $X$ and $Y$. It’s well known that the convergence time of a random walk in total variation distance can be upper bounded by the expected time until two coupled copies of the walk join. On the plane, a canonical coupling (the <em>reflection coupling</em>) between two Brownian motions can be constructed by reflecting the move of the second process through the perpendicular bisector between the locations of the two processes (see figure below). On a positively curved manifold (like a sphere), an analogous reflection can be defined, and the curvature only brings the two processes closer faster.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/reflection.jpg" width="500" />
</center>
<p>As a final tool, our proof uses a very important theorem due to <a href="https://www.sciencedirect.com/science/article/pii/S0001870876800023">Milnor</a> about manifolds with algebraic structure:</p>
<blockquote>
<p><strong>Theorem (Milnor ‘76, informal)</strong>: The Ricci curvature of a Lie group equipped with a left-invariant metric is non-negative.</p>
</blockquote>
<p>In a pinch, a Lie group is a group that also is a smooth manifold, and furthermore, the group operations result in a smooth transformation on the manifold - so that the “geometry” and “algebra” combine together. A metric is left-invariant for the group if acting on the left by any group element leaves the metric “unchanged”.</p>
<h2 id="implementing-the-decomposition-framework">Implementing the decomposition framework</h2>
<p>To apply the framework we sketched out as part of Theorem 1, we need to verify the conditions of the Theorem.</p>
<p>To prove <strong>Condition 1</strong>, we need to show that for large $\beta$, the random walk stays near to the manifold it’s been initialized close to. The main tools for this are <a href="https://en.wikipedia.org/wiki/It%C3%B4%27s_lemma">Ito’s lemma</a>, local convexity of the function $| \mathcal{A}(XX^T) - b |_2^2$ and basic results in the theory of <a href="https://en.wikipedia.org/wiki/Cox%E2%80%93Ingersoll%E2%80%93Ross_model">Cox-Ingersoll-Ross</a> processes. Namely, Ito’s lemma (which can be viewed as a “change-of-variables” formula for random variables) allows us to write down a stochastic differential equation for the evolution of the distance of $X$ from the manifold, which turns out to have a “bias” towards small values, due to the local convexity of $| \mathcal{A}(XX^T) - b |_2^2$. This can in turn be analyzed approximately as Cox-Ingersoll-Ross process - a well-studied type of non-negative stochastic process.</p>
<p>To prove <strong>Condition 2</strong>, we need to specify the partition of the space around the manifolds $E_i$. Describing the full partition is somewhat technical, but importantly, the manifolds $M^{\Delta}$ have the form $M^{\Delta} = \{\Delta U: U \mbox{ is an orthogonal matrix with det 1}\}$ for some matrix $\Delta \in \mathbb{R}^{n \times k}$.</p>
<p>The proof that $M^{\Delta}$ has a good Poincare constant (i.e. Condition 2.1) relies on two ideas: first, $M^{\Delta}$ is a Lie group with group operation $\circ$ defined such that $(\Delta U) \circ (\Delta V) := \Delta (UV)$, along with a corresponding left-invariant metric - thus, by Milnor’s theorem, it has a non-negative Ricci curvature; second, we can relate the Ricci curvatures with the Euclidean metric to the curvature with the left-invariant metric. The proof that the marginal distribution over $\Delta$ has a good Poincaré constant involves showing that this distribution is approximately log-concave. Finally, the “change-of-conditional-probability” condition (Condition 2.3) can be proved by explicit calculation.</p>
<h1 id="closing-remarks">Closing remarks</h1>
<p>In this series of posts, we surveyed two recent approaches to analyzing Langevin-like sampling algorithms <em>beyond log-concavity</em> - the most natural analogue to non-convexity in the world of sampling/inference. The structures we considered, <em>multi-modality</em> and <em>invariant manifolds</em>, are common in practice in modern machine learning.</p>
<p>Unlike non-convex optimization, provable guarantees for sampling beyond log-concavity is still under-studied and we hope our work will inspire and excite further efforts. For instance, how do we handle modes of different “shape”? Can we handle an exponential number of modes, if they have further structure (e.g., posteriors in concrete latent-variable models like Bayesian networks)? Can we handle more complex manifold structure (e.g. the matrix distributions we considered for <em>any</em> $\beta$)?</p>
Fri, 12 Mar 2021 06:00:00 -0800
http://offconvex.github.io/2021/03/12/beyondlogconcave3/
http://offconvex.github.io/2021/03/12/beyondlogconcave3/Beyond log-concave sampling (Part 2)<p>In our previous <a href="http://www.offconvex.org/2020/09/19/beyondlogconvavesampling">blog post</a>, we introduced the challenges of sampling distributions beyond log-concavity.
We first introduced the problem of sampling from a distibution $p(x) \propto e^{-f(x)}$ given value or gradient oracle access to $f$, as an analogous problem to black-box optimization with oracle access. We introduced the natural algorithm for sampling in this setup: Langevin Monte Carlo, a Markov Chain reminiscent of noisy gradient descent,</p>
\[x_{t+\eta} = x_t - \eta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I).\]
<p>Finally, we laid out the challenges when $f$ is not convex; in particular, LMC can suffer from slow mixing.</p>
<p>In this and the coming post, we describe two of our recent works tackling this problem. We identify two kinds of structure beyond log-concavity under which we can design provably efficient algorithms: <em>multi-modality</em> and <em>manifold structure in the level sets</em>. These structures commonly occur in practice, especially in problems involving statistical inference and posterior sampling in generative models.</p>
<p>In this post, we will focus on multimodality, covered by the paper <a href="https://arxiv.org/abs/1812.00793">Simulated tempering Langevin Monte Carlo</a> by Rong Ge, Holden Lee, and Andrej Risteski.</p>
<h1 id="sampling-multimodal-distributions-with-simulated-tempering">Sampling multimodal distributions with simulated tempering</h1>
<p>The classical scenario in which Langevin takes exponentially long to mix is when $p$ is a mixture of two well-separated gaussians. In broadest generality, this was considered by <a href="http://www.ems-ph.org/journals/show_abstract.php?issn=1435-9855%20&vol=6&iss=4&rank=1">Bovier et al. 2004</a> who used tools from metastable processes to show that transitioning from one peak to another can take exponential time. Roughly speaking, they show the transition time is proportional to the “energy barrier” a particle has to cross. If the gaussians have unit variance and means at distance $2r$, then the probability density at a point midway in between is $\propto e^{-r^2/2}$, and this energy barrier is $\propto e^{r^2/2}$. Thus, the mixing time is exponential. Qualitatively, the intuition for this phenomenon is simple to describe: if started at point A, the drift (i.e. gradient) term will push the walk towards A, so long as it’s close to the basin around A; hence, to transition from A to B (through C) the Gaussian noise must persistenly counteract the gradient term.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/animation_bovier.gif" width="500" />
</center>
<p>Hence Langevin on its own will not work even in very simple multimodal settings.</p>
<p>In <a href="https://arxiv.org/abs/1812.00793">our paper</a>, we show that combining Langevin Monte Carlo with a temperature-based heuristic called <em>simulated tempering</em> can significantly speed up mixing for multimodal distributions, where the number of modes is not too large, and the modes “look similar.”</p>
<p>More precisely, we show:</p>
<blockquote>
<p><strong>Theorem (Ge, Lee, Risteski ‘18, informal)</strong>: If $p(x)$ is a mixture of $k$ shifts of a strongly log-concave distribution in $d$ dimensions (e.g. Gaussian), an algorithm based on simulated tempering and Langevin Monte Carlo that runs in time poly($d,k, 1/\varepsilon$) produces samples from a distribution $\varepsilon$-close to $p$ in total variation distance.</p>
</blockquote>
<p>The main idea is to create a meta-Markov chain (the simulated tempering chain) which has two types of moves: change the current “temperature” of the sample, or move “within” a temperature. The main intuition behind this is that at higher temperatures, the distribution is flatter, so the chain explores the landscape faster (see the figure below).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/animation_tempering.gif" />
</center>
<p>More formally, the distribution at inverse temperature $\beta$ is given by $p_\beta(x) \propto e^{-\beta f(x)}$. The Langevin chain which corresponds to $\beta$ is given by</p>
\[x_{t+\eta} = x_t - \eta \beta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I).\]
<p>As in the figure above, a high temperature (low $\beta<1$) flattens out the distribution and causes the chain to mix faster (top distribution in figure). However, we can’t merely run Langevin at a higher temperature, because the stationary distribution of the high-temperature chain is wrong: it’s $p_\beta(x)$. The idea behind simulated tempering is to run Langevin chains at different temperatures, sometimes swapping to another temperature to help lower-temperature chains explore. To maintain the right stationary distributions at each temperature, we use a Metropolis-Hastings filtering step.</p>
<p>More formally, choosing a suitable sequence $0< \beta_1< \cdots <\beta_L=1$, we define the simulated tempering chain as follows.</p>
<p><img style="float: right;" src="http://holdenlee.github.io/pics/stl.png" width="300" /></p>
<ul>
<li>The <em>state space</em> is a pair of a temperature and location in space $(i, x), i \in [L], x \in \mathbb{R}^d$.<br />
<!--$L$ copies of the state space (in our case $\mathbb R^d$), one copy for each temperature.--></li>
<li>The <em>transitions</em> are defined as follows.
<ul>
<li>If the current point is $(i,x)$, then <em>evolve</em> $x$ according to Langevin diffusion with inverse temperature $\beta_i$.</li>
<li>Propose swaps with some rate $\lambda >0$. Proposing a swap means attempting to move to a neighboring chain, i.e. change $i$ to $i’=i\pm 1$. With probability $\min{p_{i’}(x)/p_i(x), 1}$, the transition is accepted. Otherwise, stay at the same point. This is a <em>Metropolis-Hastings step</em>; its purpose is to preserve the stationary distribution.</li>
</ul>
</li>
</ul>
<p>Finally, it’s not too hard to see that at the stationary distribution, the samples at the $L$th level ($\beta_L=1$) are the desired samples.</p>
<h2 id="proof-idea-decomposition-theorem">Proof idea: decomposition theorem</h2>
<p>The main strategy is inspired by Madras and Randall’s <a href="https://www.jstor.org/stable/2699896">Markov chain decomposition theorem</a>, which gives a criterion for a Markov chain to mix rapidly: partition the state space into sets, and show that</p>
<ol>
<li>The Markov chain mixes rapidly when restricted to each set of the partition.</li>
<li>The <em>projected</em> Markov Chain, which we define momentarily, mixes rapidly. If there are $m$ sets, the projected chain $\overline M$ is defined on the state space ${1,\ldots, m}$, and transition probabilities are given by average probability flows between the corresponding sets.</li>
</ol>
<p>To implement this strategy, we first have to specify the partition. In fact, we roughly show that there is a partition of $[L] \times \mathbb{R}^d$ in which:</p>
<ol>
<li>The simulated tempering Langevin chain mixes fast within each of the sets.</li>
<li>The “volume” of the sets (under the stationary distribution of the tempering chain) is not too small.
<!-- [HL: alt.] There is no set at high temperature that has much larger volume at low temperature.
--></li>
</ol>
<p>In applying the Madras-Randall framework with this partition, it’s clear that point (1) above satisfies requirement (1) for the framework; point (2) ensures that the projected Markov chain has no “bottlenecks” and hence that it mixes rapidly (requirement (2)). More precisely, we can show rapid mixing either through the method of canonical paths or Cheeger’s inequality. To do this, we exhibit a “good-probability” path between any two sets in the partition, going through the highest temperature.</p>
<p>The intuition for why this path works is illustrated in the figure below: when transitioning from the set corresponding to the left mode at level $L$ to the right mode at level $L$, each of the steps up/down the temperatures are accepted with good probability if the neighboring temperatures are not too different; at the highest temperature, the chain mixes fast by point (1), and since each of the sets are not too small by point (2), there is a reasonable probability to end at the right mode at the highest temperature.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/animation_conductance.gif" />
</center>
<!--(rework this picture?) This is a Markov chain with a small state space, so its spectral gap is easy to lower-bound (e.g., with Cheeger's inequality). The one thing we need to check is that there is no "bottleneck," i.e., one set in the partition that has low probability at high temperature and high probability at low temperature. -->
<p>Intuitively, the partition should track the “modes” of the distribution, but a technical hurdle in implementing this plan is in defining the partition when the modes overlap. One can either do this spectrally (i.e. showing that the Langevin chain has a spectral gap, and use theorems about <a href="https://arxiv.org/abs/1309.3223">spectral graph partitioning</a>, as we did in the <a href="https://arxiv.org/abs/1710.02736">first version</a> of the paper), or use a functional “soft decomposition theorem” which is a more flexible version of the classical decomposition theorem, which we use in a <a href="https://arxiv.org/abs/1812.00793">later version</a> of the paper.</p>
<!-- ![](http://holdenlee.github.io/pics/proj_chain.png)-->
Mon, 01 Mar 2021 06:00:00 -0800
http://offconvex.github.io/2021/03/01/beyondlogconcave2/
http://offconvex.github.io/2021/03/01/beyondlogconcave2/Can implicit regularization in deep learning be explained by norms?<p>This post is based on my <a href="https://arxiv.org/pdf/2005.06398.pdf">recent paper</a> with <a href="https://noamrazin.github.io/">Noam Razin</a> (to appear at NeurIPS 2020), studying the question of whether norms can explain implicit regularization in deep learning.
TL;DR: we argue they cannot.</p>
<h2 id="implicit-regularization--norm-minimization">Implicit regularization = norm minimization?</h2>
<p>Understanding the implicit regularization induced by gradient-based optimization is possibly the biggest challenge facing theoretical deep learning these days.
In classical machine learning we typically regularize via norms, so it seems only natural to hope that in deep learning something similar is happening under the hood, i.e. the implicit regularization strives to find minimal norm solutions.
This is actually the case in the simple setting of overparameterized linear regression $-$ there, by a folklore analysis (cf. <a href="https://openreview.net/pdf?id=Sy8gdB9xx">Zhang et al. 2017</a>), gradient descent (and any other reasonable gradient-based optimizer) initialized at zero is known to converge to the minimal Euclidean norm solution.
A spur of recent works (see <a href="https://arxiv.org/pdf/2005.06398.pdf">our paper</a> for a thorough review) has shown that for various other models an analogous result holds, i.e. gradient descent (when initialized appropriately) converges to solutions that minimize a certain (model-dependent) norm.
On the other hand, as discussed last year in posts by <a href="http://www.offconvex.org/2019/06/03/trajectories/">Sanjeev</a> as well as <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">Wei and myself</a>, mounting theoretical and empirical evidence suggest that it may not be possible to generally describe implicit regularization in deep learning as minimization of norms.
Which is it then?</p>
<h2 id="a-standard-test-bed-matrix-factorization">A standard test-bed: matrix factorization</h2>
<p>A standard test-bed for theoretically studying implicit regularization in deep learning is <em>matrix factorization</em> $-$ matrix completion via linear neural networks.
Wei and I already presented this model in our <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous post</a>, but for self-containedness I will do so again here.</p>
<p>In <em>matrix completion</em>, we are given entries $\{ M_{i, j} : (i, j) \in \Omega \}$ of an unknown matrix $M$, and our job is to recover the remaining entries.
This can be seen as a supervised learning (regression) problem, where the training examples are the observed entries of $M$, the model is a matrix $W$ trained with the loss:
[
\qquad \ell(W) = \sum\nolimits_{(i, j) \in \Omega} (W_{i, j} - M_{i, j})^2 ~, \qquad\qquad \color{purple}{\text{(1)}}
]
and generalization corresponds to how similar $W$ is to $M$ in the unobserved locations.
In order for the problem to be well-posed, we have to assume something about $M$ (otherwise the unobserved locations can hold any values, and guaranteeing generalization is impossible).
The standard assumption (which has many <a href="https://en.wikipedia.org/wiki/Matrix_completion#Applications">practical applications</a>) is that $M$ has low rank, meaning the goal is to find, among all global minima of the loss $\ell(W)$, one with minimal rank.
The classic algorithm for achieving this is <a href="https://en.wikipedia.org/wiki/Matrix_norm#Schatten_norms"><em>nuclear norm</em></a> minimization $-$ a convex program which, given enough observed entries and under certain technical assumptions (“incoherence”), recovers $M$ exactly (cf. <a href="https://statweb.stanford.edu/~candes/papers/MatrixCompletion.pdf">Candes and Recht</a>).</p>
<p>Matrix factorization represents an alternative, deep learning approach to matrix completion.
The idea is to use a <em>linear neural network</em> (fully-connected neural network with linear activation), and optimize the resulting objective via gradient descent (GD).
More specifically, rather than working with the loss $\ell(W)$ directly, we choose a depth $L \in \mathbb{N}$, and run GD on the <em>overparameterized objective</em>:
[
\phi ( W_1 , W_2 , \ldots , W_L ) := \ell ( W_L W_{L - 1} \cdots W_1) ~. ~~\qquad~ \color{purple}{\text{(2)}}
]
Our solution to the matrix completion problem is then:
[
\qquad\qquad W_{L : 1} := W_L W_{L - 1} \cdots W_1 ~, \qquad\qquad\qquad \color{purple}{\text{(3)}}
]
which we refer to as the <em>product matrix</em>.
While (for $L \geq 2$) it is possible to constrain the rank of $W_{L : 1}$ by limiting dimensions of the parameter matrices $\{ W_j \}_j$, from an implicit regularization standpoint, the case of interest is where rank is unconstrained (i.e. dimensions of $\{ W_j \}_j$ are large enough for $W_{L : 1}$ to take on any value).
In this case there is <em>no explicit regularization</em>, and the kind of solution GD will converge to is determined implicitly by the parameterization.
The degenerate case $L = 1$ is obviously uninteresting (nothing is learned in the unobserved locations), but what happens when depth is added ($L \geq 2$)?</p>
<p>In their <a href="https://papers.nips.cc/paper/2017/file/58191d2a914c6dae66371c9dcdc91b41-Paper.pdf">NeurIPS 2017 paper</a>, Gunasekar et al. showed empirically that with depth $L = 2$, if GD is run with small learning rate starting from near-zero initialization, then the implicit regularization in matrix factorization tends to produce low-rank solutions (yielding good generalization under the standard assumption of $M$ having low rank).
They conjectured that behind the scenes, what takes place is the classic nuclear norm minimization algorithm:</p>
<blockquote>
<p><strong>Conjecture 1 (<a href="https://papers.nips.cc/paper/7195-implicit-regularization-in-matrix-factorization.pdf">Gunasekar et al. 2017</a>; informally stated):</strong>
GD (with small learning rate and near-zero initialization) over a depth $L = 2$ matrix factorization finds solution with minimum nuclear norm.</p>
</blockquote>
<p>Moreover, they were able to prove the conjecture in a certain restricted setting, and others (e.g. <a href="http://proceedings.mlr.press/v75/li18a/li18a.pdf">Li et al. 2018</a>) later derived proofs for additional specific cases.</p>
<p>Two years after Conjecture 1 was made, in a <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">NeurIPS 2019 paper</a> with Sanjeev, Wei and Yuping Luo, we presented empirical and theoretical evidence (see <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous blog post</a> for details) which led us to hypothesize the opposite, namely, that for any depth $L \geq 2$, the implicit regularization in matrix factorization can <em>not</em> be described as minimization of a norm:</p>
<blockquote>
<p><strong>Conjecture 2 (<a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">Arora et al. 2019</a>; informally stated):</strong>
Given a depth $L \geq 2$ matrix factorization, for any norm $\|{\cdot}\|$, there exist matrix completion tasks on which GD (with small learning rate and near-zero initialization) finds solution that does not minimize $\|{\cdot}\|$.</p>
</blockquote>
<p>Due to technical subtleties in their formal statements, Conjectures 1 and 2 do not necessarily contradict.
However, they represent opposite views on the question of whether or not norms can explain implicit regularization in matrix factorization.
The goal of my recent work with <a href="https://noamrazin.github.io/">Noam</a> was to resolve this open question.</p>
<h2 id="implicit-regularization-can-drive-all-norms-to-infinity">Implicit regularization can drive all norms to infinity</h2>
<p>The main result in our <a href="https://arxiv.org/pdf/2005.06398.pdf">paper</a> is a proof that there exist simple matrix completion settings where the implicit regularization in matrix factorization drives <strong><em>all norms towards infinity</em></strong>.
By this we affirm Conjecture 2, and in fact go beyond it in the following sense:
<em>(i)</em> not only is each norm disqualified by some setting, but there are actually settings that jointly disqualify all norms;
and
<em>(ii)</em> not only are norms not necessarily minimized, but they can grow towards infinity.</p>
<p>The idea behind our analysis is remarkably simple.
We prove the following:</p>
<blockquote>
<p><strong>Theorem (informally stated):</strong>
During GD over matrix factorization (i.e. over $\phi ( W_1 , W_2 , \ldots , W_L)$ defined by Equations $\color{purple}{\text(1)}$ and $\color{purple}{\text(2)}$), if learning rate is sufficiently small and initialization sufficiently close to the origin, then the determinant of the product matrix $W_{1: L}$ (Equation $\color{purple}{\text(3)}$) doesn’t change sign.</p>
</blockquote>
<p>A corollary is that if $\det ( W_{L : 1} )$ is positive at initialization (an event whose probability is $0.5$ under any reasonable initialization scheme), then it stays that way throughout.
This seemingly benign observation has far-reaching implications.
As a simple example, consider the following matrix completion problem ($*$ here stands for unobserved entry):
[
\qquad\qquad
\begin{pmatrix}
* & 1 \newline
1 & 0
\end{pmatrix}
~. \qquad\qquad \color{purple}{\text{(4)}}
]
Every solution to this problem, i.e. every matrix that agrees with its observations, must have determinant $-1$.
It is therefore only logical to expect that when solving the problem using matrix factorization, the determinant of the product matrix $W_{L : 1}$ will converge to $-1$.
On the other hand, we know that (with probability $0.5$ over initialization) $\det ( W_{L : 1} )$ is always positive, so what is going on?
This conundrum can only mean one thing $-$ as $W_{L : 1}$ fits the observations, its value in the unobserved location (i.e. $(W_{L : 1})_{11}$) diverges to infinity, which implies that <em>all norms grow to infinity!</em></p>
<p>The above idea goes way beyond the simple example given in Equation $\color{purple}{\text(4)}$.
We use it to prove that in a wide array of matrix completion settings, the implicit regularization in matrix factorization leads norms to <em>increase</em>.
We also demonstrate it empirically, showing that in such settings unobserved entries grow during optimization.
Here’s the result of an experiment with the setting of Equation $\color{purple}{\text(4)}$:</p>
<div style="text-align:center;">
<img style="width:300px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_mf_exp.png" />
<br />
<i><b>Figure 1:</b>
Solving matrix completion problem defined by Equation $\color{purple}{\text(4)}$ using matrix factorization leads absolute value of unobserved entry to increase (which in turn means norms increase) as loss decreases.
</i>
</div>
<h2 id="what-is-happening-then">What is happening then?</h2>
<p>If the implicit regularization in matrix factorization is not minimizing a norm, what is it doing?
While a complete theoretical characterization is still lacking, there are signs that a potentially useful interpretation is <strong><em>minimization of rank</em></strong>.
In our aforementioned <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">NeurIPS 2019 paper</a>, we derived a dynamical characterization (and showed supporting experiments) suggesting that matrix factorization is implicitly conducting some kind of greedy low-rank search (see <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous blog post</a> for details).
This phenomenon actually facilitated a new autoencoding architecture suggested in a recent <a href="https://arxiv.org/pdf/2010.00679.pdf">empirical paper</a> (to appear at NeurIPS 2020) by Yann LeCun and his team at Facebook AI.
Going back to the example in Equation $\color{purple}{\text(4)}$, notice that in this matrix completion problem all solutions have rank $2$, but it is possible to essentially minimize rank to $1$ by taking (absolute value of) unobserved entry to infinity.
As we’ve seen, this is exactly what the implicit regularization in matrix factorization does!</p>
<p>Intrigued by the rank minimization viewpoint, <a href="https://noamrazin.github.io/">Noam</a> and I empirically explored an extension of matrix factorization to <em>tensor factorization</em>.
Tensors can be thought of as high dimensional arrays, and they admit natural factorizations similarly to matrices (two dimensional arrays).
We found that on the task of <em>tensor completion</em> (defined analogously to matrix completion $-$ see Equation $\color{purple}{\text(1)}$ and surrounding text), GD on a tensor factorization tends to produce solutions with low rank, where rank is defined in the context of tensors (for a formal definition, and a general intro to tensors and their factorizations, see this <a href="http://www.kolda.net/publication/TensorReview.pdf">excellent survey</a> by Kolda and Bader).
That is, just like in matrix factorization, the implicit regularization in tensor factorization also strives to minimize rank!
Here’s a representative result from one of our experiments:</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_tf_exp.png" />
<br />
<i><b>Figure 2:</b>
In analogy with matrix factorization, the implicit regularization of tensor factorization (high dimensional extension) strives to find a low (tensor) rank solution.
Plots show reconstruction error and (tensor) rank of final solution on multiple tensor completion problems differing in the number of observations.
GD over tensor factorization is compared against "linear" method $-$ GD over direct parameterization of tensor initialized at zero (this is equivalent to fitting observations while placing zeros in unobserved locations).
</i>
<br />
<br />
</div>
<p>So what can tensor factorizations tell us about deep learning?
It turns out that, similarly to how matrix factorizations correspond to prediction of matrix entries via linear neural networks, tensor factorizations can be seen as prediction of tensor entries with a certain type of <em>non-linear</em> neural networks, named <em>convolutional arithmetic circuits</em> (in my PhD I worked a lot on analyzing the expressive power of these models, as well as showing that they work well in practice $-$ see this <a href="https://arxiv.org/pdf/1705.02302.pdf">survey</a> for a soft overview).</p>
<div style="text-align:center;">
<img style="width:900px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_mf_lnn_tf_cac.png" />
<br />
<i><b>Figure 3:</b>
The equivalence between matrix factorizations and linear neural networks extends to an equivalence between tensor factorizations and a certain type of non-linear neural networks named convolutional arithmetic circuits.
</i>
<br />
<br />
</div>
<p>Analogously to how the input-output mapping of a linear neural network can be thought of as a matrix, that of a convolutional
arithmetic circuit is naturally represented by a tensor.
The experiment reported in Figure 2 (and similar ones presented in <a href="https://arxiv.org/pdf/2005.06398.pdf">our paper</a>) thus provides a second example of a neural network architecture whose implicit regularization strives to lower a notion of rank for its input-output mapping.
This leads us to believe that implicit rank minimization may be a general phenomenon, and developing notions of rank for input-output mappings of contemporary models may be key to explaining generalization in deep learning.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Fri, 27 Nov 2020 01:00:00 -0800
http://offconvex.github.io/2020/11/27/reg_dl_not_norm/
http://offconvex.github.io/2020/11/27/reg_dl_not_norm/How to allow deep learning on your data without revealing the data<p>Today’s online world and the emerging internet of things is built around a Faustian bargain: consumers (and their internet of things) hand over their data, and in return get customization of the world to their needs. Is this exchange of privacy for convenience inherent? At first sight one sees no way around because, of course, to allow machine learning on our data we have to hand our data over to the training algorithm.</p>
<p>Similar issues arise in settings other than consumer devices. For instance, hospitals may wish to pool together their patient data to train a large deep model. But privacy laws such as HIPAA forbid them from sharing the data itself, so somehow they have to train a deep net on their data without revealing their data. Frameworks such as Federated Learning (<a href="https://arxiv.org/abs/1610.05492">Konečný et al., 2016</a>) have been proposed for this but it is known that sharing gradients in that environment leaks a lot of information about the data (<a href="https://arxiv.org/abs/1906.08935">Zhu et al., 2019</a>).</p>
<p>Methods to achieve some of the above so could completely change the privacy/utility tradeoffs implicit in today’s organization of the online world.</p>
<p>This blog post discusses the current set of solutions, how they don’t quite suffice for above questions, and the story of a new solution, <a href="http://arxiv.org/abs/2010.02772">InstaHide</a>, that we proposed, and takeaways from a recent attack on it by Carlini et al.</p>
<h2 id="existing-solutions-in-cryptography">Existing solutions in Cryptography</h2>
<p>Classic solutions in cryptography do allow you to in principle outsource any computation to the cloud without revealing your data. (A modern method is Fully Homomorphic Encryption.) Adapting these ideas to machine learning presents two major obstacles: (a) (serious issue) huge computational overhead, which essentially rules it out for today’s large scale deep models (b) (less serious issue) need for special setups —e.g., requiring every user to sign up for public-key encryption.</p>
<p>Significant research efforts are being made to try to overcome these obstacles and we won’t survey them here.</p>
<h2 id="differential-privacy-dp">Differential Privacy (DP)</h2>
<p>Differential privacy (<a href="https://www.iacr.org/archive/eurocrypt2006/40040493/40040493.pdf">Dwork et al., 2006</a>, <a href="https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf">Dwork&Roth, 2014</a>) involves adding carefully calculated amounts of noise during training. This is a modern and rigorous version of classic <em>data anonymization</em> techniques whose canonical application is release of noised census data to protect privacy of individuals.</p>
<p>This notion was adapted to machine learning by positing that “privacy” in machine learning refers to trained classifiers not being dependent on data of individuals. In other words, if the classifier is trained on data from N individuals, it’s behavior should be essentially unchanged (statistically speaking) if we omit data from any individual. Note that this is a weak notion of privacy: it does not in any way hide the data from the company.</p>
<p>Many tech companies have adopted differential privacy in deployed systems but the following two caveats are important.</p>
<blockquote>
<p>(Caveat 1): In deep learning applications, DP’s provable guarantees are very weak.</p>
</blockquote>
<p>Applying DP to deep learning involves noticing that the gradient computation amounts to adding gradients of the loss corresponding to individual data points, and that adding noise to those individual gradients in calculated doses can help make the overall classifier limit its dependence on the individual’s datapoint.</p>
<p>In practice provable bounds require adding so much gradient noise that accuracy of the trained classifier plummets. We do not know of any successful training that achieved accuracy > 75 percent on CIFAR10 (or any that achieved accuracy even 10 percent on ImageNet). Furthermore, achieving this level of accuracy involves <strong>pretraining</strong> the classifier model on a large set of <strong>public</strong> images and then using the private/protected images only to fine-tune the parameters.</p>
<p>Thus it is no surprise that firms today usually apply DP with very low noise level, which give essentially no guarantees. Which brings us to:</p>
<blockquote>
<p>(Caveat 2): DP’s guarantees (and even weaker guarantees applying to deployment scenarios) possibly act as a fig leaf that allows firms to not address the kinds of privacy violations that the person on the street actually worries about.</p>
</blockquote>
<p>DP’s provable guarantee (which as noted, does not hold in deployed systems due to the low noise level used) would only ensure that a deployed ML software that was trained with data from tens of millions of users will not change its behavior depending upon private information of any single user.</p>
<p>But that threat model would seem remote to the person on the street. The privacy issue they worry about more is that copious amounts of our data are continuously collected/stored/mined/sold, often by entities we do not even know about. While lax regulation is primarily to blame, there is also the technical hurdle that there is no <strong>practical way</strong> for consumers to hide their data while at the same time benefiting from customized ML solutions that improve their lives.</p>
<p>Which brings us to the question we started with: <em>Could consumers allow machine learning to be done on their data without revealing their data?</em></p>
<h2 id="a-proposed-solution-instahide">A proposed solution: InstaHide</h2>
<p>InstaHide is a new concept: it hides or “encrypts” images to protect them somewhat, while still allowing standard deep learning pipelines to be applied on them. The deep model is trained entirely on encrypted images.</p>
<ul>
<li>
<p>The training speed and accuracy is only slightly worse than vanilla training: one can achieve a test accuracy of ~ 90 percent on CIFAR10 using encrypted images with a computation overhead $< 5$ percent.</p>
</li>
<li>
<p>When it comes to privacy, like every other form of cryptography, its security is based upon conjectured difficulty of the underlying computational problem.
(But we don’t expect breaking it to be as difficult as say breaking RSA.)</p>
</li>
</ul>
<h3 id="how-instahide-encryption-works">How InstaHide encryption works</h3>
<p>Here are some details. InstaHide belongs to the class of subset-sum type encryptions (<a href="https://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/biwx.pdf">Bhattacharyya et al., 2011</a>), and was inspired by a data augmentation technique called Mixup (<a href="https://arxiv.org/abs/1710.09412">Zhang et al., 2018</a>). It views images as vectors of pixel values. With vectors you can take linear combinations. The figure below shows the result of a typical MixUp: adding 0.6 times the bird image with 0.4 times the airplane image. The image labels can also be treated as one-hot vectors, and they are mixed using the same coefficients in front of the image samples.</p>
<p style="text-align:center;">
<img src="/assets/mixup.png" width="60%" />
</p>
<p>To encrypt the bird image, InstaHide does mixup (i.e., combination with nonnegative coefficients) with one other randomly chosen training image, and with two other images chosen randomly from a large public dataset like imagenet. The coefficients 0.6., 0.4 etc. in the figure are also chosen at random. Then it takes this composite image and for every pixel value, it randomly flips the sign. With that, we get the encrypted images and labels. All random choices made in this encryption act as a one-time key that is never re-used to encrypt other images.</p>
<p>InstaHide has a parameter $k$ denoting how many images are mixed; in the picture, we have $k=4$. The figure below shows this encryption mechanism.</p>
<p style="text-align:center;">
<img src="/assets/instahide.png" width="80%" />
</p>
<p>When plugged into the standard deep learning with a private dataset of $n$ images, in each epoch of training (say $T$ epochs in total), InstaHide will re-encrypt each image in the dataset using a random one-time key. This will gives $n\times T$ encrypted images in total.</p>
<h3 id="the-security-argument">The security argument</h3>
<p>We conjectured, based upon intuitions from computational complexity of the k-vector-subset-sum problem (citations), that extracting information about the images could time $N^{k-2}$. Here $N$, the size of the public dataset, can be tens or hundreds of millions, so it might be infeasible for real-life attackers.</p>
<p>We also released a <a href="https://github.com/Hazelsuko07/InstaHide_Challenge">challenge dataset</a> with $k=6, n=100, T=50$ to enable further investigation of InstaHide’s security.</p>
<h2 id="carlini-et-als-recent-attack-on-instahide">Carlini et al.’s recent attack on InstaHide</h2>
<p>Recently, Carlini et al. have shared with us a manuscript with a two-step reconstruction attack (<a href="https://arxiv.org/pdf/2011.05315.pdf">Carlini et al., 2020</a>) against InstaHide.</p>
<p><strong><em>TL;DR: They used 11 hours on Google’s best GPUs to get partial recovery of our 100 challenge encryptions and 120 CPU hours to break the encryption completely. Furthermore, the latter was possible entirely because we used an insecure random number generator, and they used exhaustive search over random seeds.</em></strong></p>
<p>Now the details.</p>
<p>The attack takes $n\times T$ InstaHide-encrypted images as the input, ($n$ is the size of the private dataset, $T$ is the number of training epochs), and returns a reconstruction of the private dataset. It goes as follows.</p>
<ul>
<li>
<p>Map $n \times T$ encryptions into $n$ private images, by clustering encryptions of a same private image as a group. This is achieved by firstly building a graph representing pairwise similarity between encrypted images, and then assign each encryption a private image. In their implementation, they train a neural network to annotate pairwise similarity between encryptions.</p>
</li>
<li>
<p>Then, given the encrypted images and the mapping, they solve a nonlinear optimization problem via gradient desent to recover an approximation of the original private dataset.</p>
</li>
</ul>
<p>Using Google’s powerful GPU, it took them 10 hours to train the neural network for similarity annotation, and about another hour to get an approximation of our challenge set of $100$ images with $k=6, n=100, T=50$. This gave them vaguely correct images, with significant unclear areas and color shift.</p>
<p>They also proposed a different strategy which abuses the vulnerability of NumPy and PyTorch’s random number generator (<em>Aargh; we didn’t use a secure random number generator.</em>) They did brute force search of $2^{32}$ possible initial random seeds, which allows them to reproduce the randomness during encryption, and thus perform a pixel-perfect reconstruction. As they reported, this attack takes 120 CPU hours (they parallelize across 100 cores to obtain the solution in a little over an hour). We will have this implementation flaw fixed in an updated version.</p>
<h3 id="thoughts-on-this-attack">Thoughts on this attack</h3>
<p>Though the attack is clever and impressive, we feel that the long-term take-away is still unclear for several reasons.</p>
<blockquote>
<p>Variants of InstaHide seem to evade the attack.</p>
</blockquote>
<p>The challenge set contained 50 encryptions each of 100 images. This corresponds to using encrypted images for 50 epochs. But as done in existing settings that use DP, one can pretrain the deep model using non-private images and then fine-tune it with fewer epochs of the private images. Using a similar pipeline DPSGD (<a href="https://arxiv.org/abs/1607.00133">Abadi et al., 2016</a>), by pretraining a ResNet-18 on CIFAR100 (the public dataset) and finetuning for $10$ epochs on CIFAR10 (the private dataset) gives accuracy of 83 percent, still far better than any provable guarantees using DP on this dataset. Carlini et al.\ team conceded that their attack probably would not work in this setting.</p>
<p>Similarly using InstaHide purely at inference time (i.e., using ML, instead of training ML) still should be completely secure since only one encryption of the image is released. The new attack can’t work here at all.</p>
<blockquote>
<p>InstaHide was never intended to be a mission-critical encryption like RSA (which by the way also has no provable guarantees).</p>
</blockquote>
<p>InstaHide is designed to give users and the internet of things a <em>light-weight</em> encryption method that allows them to use machine learning without giving eavesdroppers or servers access to their raw data. There is no other cost-effective alternative to InstaHide for this application. If it takes Google’s powerful computers a few hours to break our challenge set of 100 images, this is not yet a cost-effective attack in the intended settings.</p>
<p>More important, the challenge dataset corresponded to an ambitious form of security, where the encrypted images themselves are released to the world. The more typical application is a Federated Learning (<a href="https://arxiv.org/abs/1610.05492">Konečný et al., 2016</a>) scenario: the adversary observes shared gradients that are computed using encrypted images (he also has access to the trained model). The attacks in this paper do not currently apply to that scenario. This is also the idea in <a href="https://arxiv.org/abs/2010.06053"><strong>TextHide</strong></a>, an adaptation of InstaHide to text data.</p>
<h2 id="takeways">Takeways</h2>
<p>Users need lightweight encryptions that can be applied in real time to large amounts of data, and yet allow them to take benefit of Machine Learning on the cloud. Methods to do so could completely change the privacy/utility tradeoffs implicitly assumed in today’s tech world.</p>
<p>InstaHide is the only such tool right now, and we now know that it provides moderate security that may be enough for many applications.</p>
<!--
### References
[1] [**InstaHide: Instance-hiding Schemes for Private Distributed Learning**](http://arxiv.org/abs/2010.02772), *Yangsibo Huang, Zhao Song, Kai Li, Sanjeev Arora*, ICML 2020
[2] [**mixup: Beyond Empirical Risk Minimization**](https://arxiv.org/abs/1710.09412), *Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz*, ICLR 2018
[3] [**An Attack on InstaHide: Is Private Learning Possible with Instance Encoding?**](https://arxiv.org/pdf/2011.05315.pdf) *Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramèr*, arxiv preprint
[4] [**Deep Learning with Differential Privacy**](https://arxiv.org/abs/1607.00133), *Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang*, ACM CCS 2016
[5] [**Federated learning: Strategies for improving communication efficiency**](https://arxiv.org/abs/1610.05492), *Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, Dave Bacon*, NeurIPS Workshop 2016
[6] [**A method for obtaining digital signatures and public-key cryptosystems**](https://people.csail.mit.edu/rivest/Rsapaper.pdf), *R.L. Rivest, A. Shamir, and L. Adleman*, Communications of the ACM 1978
[7] [**Deep leakage from gradients**](https://arxiv.org/abs/1906.08935), *Ligeng Zhu, Zhijian Liu, and Song Han.* Neurips19. -->
Wed, 11 Nov 2020 02:00:00 -0800
http://offconvex.github.io/2020/11/11/instahide/
http://offconvex.github.io/2020/11/11/instahide/Mismatches between Traditional Optimization Analyses and Modern Deep Learning<p>You may remember our <a href="http://www.offconvex.org/2020/04/24/ExpLR1/">previous blog post</a> showing that it is possible to do state-of-the-art deep learning with learning rate that increases exponentially during training. It was meant to be a dramatic illustration that what we learned in optimization classes and books isn’t always a good fit for modern deep learning, specifically, <em>normalized nets</em>, which is our term for nets that use any one of popular normalization schemes,e.g. <a href="https://arxiv.org/abs/1502.03167">BatchNorm (BN)</a>, <a href="https://arxiv.org/abs/1803.08494">GroupNorm (GN)</a>, <a href="https://arxiv.org/abs/1602.07868">WeightNorm (WN)</a>. Today’s post (based upon <a href="https://arxiv.org/abs/2010.02916">our paper</a> with Kaifeng Lyu at NeurIPS20) identifies other surprising incompatibilities between normalized nets and traditional analyses. We hope this will change the way you teach and think about deep learning!</p>
<p>Before diving into the results, we recall that normalized nets are typically trained with weight decay (aka $\ell_2$ regularization). Thus the $t$th iteration of Stochastic Gradient Descent (SGD) is:</p>
\[w_{t+1} \gets (1-\eta_t\lambda)w_t - \eta_t \nabla \mathcal{L}(w_t; \mathcal{B}_t),\]
<p>where $\lambda$ is the weight decay (WD) factor (or $\ell_2$-regularization coefficient), $\eta_t$ the learning rate, $\mathcal{B}_t$ the batch, and $\nabla \mathcal{L}(w_t,\mathcal{B}_t)$ the batch gradient.</p>
<p>As sketched in our previous blog post, under fairly mild assumptions (namely, fixing the top layer during random initialization —which empirically does not hurt final accuracy) the loss function for training such normalized nets is <em>scale invariant</em>, which means $\mathcal{L}(w _ t; \mathcal{B}_ t)=\mathcal{L}(cw _ t; \mathcal{B} _ t)$, $\forall c>0$.</p>
<p>A consequence of scale invariance is that the $ \nabla _ w \mathcal{L} \vert _ {w = w _ 0} = c \nabla _ w \mathcal{L}\vert _ {w = cw _ 0}$ and $\nabla ^ 2 _ w \mathcal{L} \vert _ {w = w _ 0} = c ^ 2 \nabla ^ 2 _ w \mathcal{L} \vert _ {w = cw _ 0}$, for any $c>0$.</p>
<h2 id="some-conventional-wisdoms-cws">Some Conventional Wisdoms (CWs)</h2>
<p>Now we briefly describe some conventional wisdoms. Needless to say, by the end of this post these will turn out to be very very suspect! Possibly they were OK in earlier days of deep learning, and with shallower nets.</p>
<blockquote>
<p>CW 1) As we reduce LR to zero, optimization dynamic converges to a deterministic path (Gradient Flow) along which training loss strictly decreases.</p>
</blockquote>
<p>Recall that in traditional explanation of (deterministic) gradient descent, if LR is smaller than roughly the inverse of the smoothness of the loss function, then each step reduces the loss. SGD, being stochastic, has a distribution over possible paths. But very tiny LR can be thought of as full-batch Gradient Descent (GD), which in the limit of infinitesimal step size approaches Gradient Flow (GF).</p>
<p>The above reasoning shows very small LR is guaranteed to decrease the loss at least, as well as any higher LR, can. Of course, in deep learning, we care not only about optimization but also generalization. Here small LR is believed to hurt.</p>
<blockquote>
<p>CW 2) To achieve the best generalization the LR must be large initially for quite a few epochs.</p>
</blockquote>
<p>This is primarily an empirical finding: using too-small learning rates or too-large batch sizes from the start (all other hyper-parameters being fixed) is known to lead to worse generalization (<a href="https://arxiv.org/pdf/1206.5533.pdf">Bengio, 2012</a>; <a href="https://arxiv.org/abs/1609.04836">Keskar et al., 2017</a>).</p>
<p>A popular explanation for this phenomenon is that the noise in gradient estimation during SGD is beneficial for generalization. (As noted, this noise tends to average out when LR is very small.) Many authors have suggested that the noise helps becauses it keeps the trajectory away from sharp minima which are believed to generalize worse, although there is some difference of opinion here (<a href="http://www.bioinf.jku.at/publications/older/3304.pdf">Hochreiter&Schmidhuber, 1997</a>; <a href="https://arxiv.org/abs/1609.04836">Keskar et al., 2017</a>; <a href="https://arxiv.org/abs/1712.09913">Li et al., 2018</a>; <a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>; <a href="https://arxiv.org/pdf/1902.00744.pdf">He et al., 2019</a>). <a href="https://arxiv.org/abs/1907.04595">Li et al., 2019</a> also gave an example (a simple two-layer net) where this observation of worse generalization due to small LR is mathematically proved and also experimentally verified.</p>
<blockquote>
<p>CW 3) Modeling SGD via a Stochastic Differential Equation (SDE) in the continuous-time limit with a fixed Gaussian noise. Namely, think of SGD as a diffusion process that <strong>mixes</strong> to some Gibbs-like distribution on trained nets.</p>
</blockquote>
<p>This is the usual approach to formal understanding of CW 2 (<a href="https://arxiv.org/abs/1710.06451">Smith&Le, 2018</a>; <a href="https://arxiv.org/abs/1710.11029">Chaudhari&Soatto, 2018</a>; <a href="https://arxiv.org/abs/2004.06977">Shi et al., 2020</a>). The idea is that SGD is gradient descent with a noise term, which has a continuous-time approximation as a diffusion process described as</p>
\[dW_t = - \eta_t \lambda W_t dt - \eta_t \nabla \mathcal{L}(W_t) dt + \eta_t \Sigma_{W_t}^{1/2} dB_t,\]
<p>where $\sigma_{W_t}$ is the covariance of stochastic gradient $ \nabla \mathcal{L}(w_t; \mathcal{B}_t)$, and $B_t$ denotes Brownian motion of the appropriate dimension. Several works have adopted this SDE view and given some rigorous analysis of the effect of noise.</p>
<p>In this story, SGD turns into a geometric random walk in the landscape, which can in principle explore the landscape more thoroughly, for instance by occasionally making loss-increasing steps. While an appealing view, rigorous analysis is difficult because we lack a mathematical description of the loss landscape. Various papers assume the noise in SDE is isotropic Gaussian, and then derive an expression for the stationary distribution of the random walk in terms of the familiar Gibbs distribution. This view gives intuitively appealing explanation of some deep learning phenomena since the magnitude of noise (which is related to LR and batch size) controls the convergence speed and other properties. For instance it’s well-known that this SDE approximation implies the well-known <em>linear scaling rule</em> (Goyal et. al., 2017](https://arxiv.org/pdf/1706.02677.pdf)).</p>
<p>Which raises the question: <em>does SGD really behave like a diffusion process that mixes in the loss landscape?</em></p>
<!--[A few lines explaining for why noise term has this form? e.g., show one step discretization]!-->
<h2 id="conventional-wisdom-challenged">Conventional Wisdom challenged</h2>
<p>We now describe the actual discoveries for normalized nets, which show that the above CW’s are quite off.</p>
<blockquote>
<p>(Against CW1): Full batch gradient descent $\neq$ gradient flow.</p>
</blockquote>
<p>It’s well known that if LR is smaller than the inverse of the smoothness, then the trajectory of gradient descent will be close to that of gradient flow. But for normalized networks, the loss function is scale-invariant and thus provably non-smooth (i.e., smoothness becomes unbounded) around the origin (<a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>). We show that this non-smoothness issue is very real and makes training unstable and even chaotic for full batch SGD with any nonzero learning rate. This occurs both empirically and provably so with some toy losses.</p>
<div style="text-align:center;">
<img style="width:60%;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/gd_not_gf.png" />
</div>
<p><strong>Figure 1.</strong> WD makes GD on scale-invariant loss unstable and chaotic.
(a) Toy model with scale-invariant loss $L(x,y) = \frac{x^2}{x^2+y^2}$ (b)(c) Convergence never truly happens for ResNet trained on sub-sampled
CIFAR10 containing 1000 images with full-batch GD (without momentum). ResNet
can easily get to 100% training accuracy but then veers off. When WD is turned off at epoch 30000 it converges.</p>
<p>Note that WD plays a crucial role in this effect since without WD the parameter norm increases monotonically
(<a href="https://arxiv.org/abs/1812.03981">Arora et al., 2018</a>) which implies SGD moves away from the origin at all times.</p>
<p>Savvy readers might wonder whether using a smaller LR could fix this issue. Unfortunately, getting close to the origin is unavoidable because once the gradient gets small, WD will dominate the dynamics and decrease the norm at a geometric rate, causing the gradient to rise again due to the scale invariance! (This happens so long as the gradient gets arbitrarily small, but not actually zero, as is the case in practice.)</p>
<p>In fact, this is an excellent (and rare) place where early stopping is necessary even for correct optimization of the loss.</p>
<blockquote>
<p>(Against CW 2) Small LR can generalize equally well as large LR.</p>
</blockquote>
<p>This actually was a prediction of the new theoretical analysis we came up with. We ran extensive experiments to test this prediction and found that initial large LR is <strong>not necessary</strong> to match the best performance, even when <em>all the other hyperparameters are fixed</em>. See Figure 2.</p>
<div style="text-align:center;">
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/blog_sgd_8000_test_acc.png" />
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/blog_sgd_8000_train_acc.png" />
</div>
<p><strong>Figure 2</strong>. ResNet trained on CIFAR10 with SGD with normal LR schedule (baseline) as well as a schedule with 100 times smaller initial LR. The latter matches performance of baseline after one more LR decay! Note it needs 5000 epochs which is 10x higher! See our paper for details. (Batch size is 128, WD is 0.0005, and LR is divided by 10 for each decay.)</p>
<p>Note the surprise here is that generalization was not hurt from drastically smaller LR even <em>when no other hyperparameter changes</em>. It is known empirically as well as rigorously (Lemma 2.4 in <a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>) that it is possible to compensate for small LR by other hyperparameter changes.</p>
<blockquote>
<p>(Against Wisdom 3) Random walk/SDE view of SGD is way off. There is no evidence of mixing as traditionally understood, at least within normal training times.</p>
</blockquote>
<p>Actually the evidence against global mixing exists already via the phenomenon of Stochastic Weight Averaging (SWA) (<a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>). Along the trajectory of SGD, if the network parameters from two different epochs are averaged, then the average has test loss lower than either. Improvement via averaging continues to work for run times 10X longer than usual as shown in Figure 3. However, the accuracy improvement doesn’t happen for SWA between two solutions obtained from different initialization. Thus checking whether SWA holds distinguishes between pairs of solutions drawn from the same trajectory and pairs drawn from different trajectories, which shows the diffusion process hasn’t mixed to stationary distribution within normal training times. (This is not surprising, since the theoretical analysis of mixing does not suggest it happens rapidly at all.)</p>
<div style="text-align:center;">
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/swa_sgd_test_acc.png" />
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/swa_sgd_dist.png" />
</div>
<p><strong>Figure 3</strong>. Stochastic Weight Averaging improves the test accuracy of ResNet trained with
SGD on CIFAR10. <strong>Left:</strong> Test accuracy. <strong>Right:</strong> Pairwise distance between parameters from different epochs.</p>
<p>Actually <a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a> already noticed the implication that SWA rules out that SGD is a diffusion process which mixes to a unique global equilibrium. They suggested instead that perhaps the trajectory of SGD could be well-approximated by a multivariate Ornstein-Uhlenbeck (OU) process around the <em>local minimizer</em> $W^ * $, assuming the loss surface is locally strongly convex. As the corresponding stationary is multi-dimensional Gaussian, $N(W^ *, \Sigma)$, around the local minimizer, $W^ *$, this explains why SWA helps to reduce the training loss.</p>
<p>However, we note that (<a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>)’s suggestion is also refuted by the fact that we can show $\ell_2$ distance between weights from epochs $T$ and $T+\Delta$ monotonically increases with $\Delta$ for every $T$ (See Figure 3), while $ \mathbf{E} [ | W_ T-W_ {T+\Delta} |^2]$ should converge to the constant $2Tr[\Sigma]$ as $T, \Delta \to +\infty$ in the OU process. This suggests that all these weights are correlated, unlike the hypothesized OU process.</p>
<h2 id="so-whats-really-going-on">So what’s really going on?</h2>
<p>We develop a new theory (some parts rigorously proved and others supported by experiments) suggesting that <strong>LR doesn’t play the role assumed in most discussions.</strong></p>
<p>It’s widely believed that LR $\eta$ controls the convergence rate of SGD and affects the generalization via changing the magnitude of noise because LR $\eta$ adjusts the magnitude of gradient update per step.
<!--It's also worth noting that for vanilla SGD, changing LR is equivalent to rescaling the loss function. -->
However, for normalized networks trained with SGD + WD, the effect of LR is more subtle as now it has two roles: (1). the multiplier before the gradient of the loss. (2). the multiplier before WD. Intuitively, one imagines the WD part is useless since the loss function is scale-invariant, and thus the first role must be more important. But surprisingly, this intuition is completely wrong and it turns out that the second role is way more important than the first one.
Further analysis shows that a better measure of speed of learning is $\eta \lambda$, which we call the <em>intrinsic learning rate</em> or <em>intrinsic LR</em>, denoted $\lambda_e$.</p>
<p>While previous papers have noticed qualitatively that LR and WD have a close interaction, our ExpLR paper <a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>) gave mathematical proof that <em>if WD* LR, i.e., $\lambda\eta$ is fixed, then the effect of changing LR on the dynamics is equivalent to rescaling the initial parameters</em>. As far as we can tell, performance of SGD on modern architectures is quite robust to (indeed usually independent of) scale of the initialization, so the effect of changing initial LR while keeping intrinsic LR fixed is also negligible.</p>
<p>Our paper gives insight into the role of intrinsic LR $\lambda_e$ by giving a new SDE-style analysis of SGD for normalized nets, leading to the following conclusion (which rests in part on experiments):</p>
<blockquote>
<p>In normalized nets SGD does indeed lead to rapid mixing, but in <strong>function space</strong> (i.e., input-output behavior of the net). Mixing happens after $O(1/\lambda_e)$ iterations, in contrast to the exponentially slow mixing guaranteed in the parameter space by traditional analysis of diffusion walks.</p>
</blockquote>
<p>To explain the meaning of mixing in function space, let’s view SGD (carried out for a fixed number of steps) as a way to sample a trained net from a distribution over trained nets. Thus the end result of SGD from a fixed initialization can be viewed as a probabilistic classifier whose output on any datapoint is the $K$-dimenstional vector whose $i$th coordinate is the probability of outputting label $i$. (Here $K$ is the total number of labels.) Now if two different initializations both cause SGD to produce classifiers with error $5$ percent on heldout datapoints, then <em>a priori</em> one would imagine that on a given held-out datapoint the classifier from the first distribution <strong>disagrees</strong> with the classifier from the second distribution with roughly $2 * 5 =10$ percent probability. (More precisely, $2 * 5 * (1-0.05) = 9.5$ percent.) However, convergence to an equilibrium distribution in function space means that the probability of disagreement is almost $0$, i.e., the distribution is almost the same regardless of the initialization! This is indeed what we experimentally find, to our big surprise. Our theory is built around this new phenomenon.</p>
<div style="text-align:center;">
<img style="width:500px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/conjecture.png" />
</div>
<p><strong>Figure 4</strong>: A simple 4-layer normalized CNN trained on MNIST with three schedules converge to the same equilibrium after intrinsic LRs become equal at epoch 81. We use Monte Carlo ($500$ trials) to estimate $\ell_1$ distances between distributions.</p>
<p>In the next post, we will explain our new theory and the partial new analysis of SDEs arising from SGD in normalized nets.</p>
Wed, 21 Oct 2020 15:00:00 -0700
http://offconvex.github.io/2020/10/21/intrinsicLR/
http://offconvex.github.io/2020/10/21/intrinsicLR/Beyond log-concave sampling<p>As the growing number of posts on this blog would suggest, recent years have seen a lot of progress in understanding optimization beyond convexity. However, optimization is only one of the basic algorithmic primitives in machine learning — it’s used by most forms of risk minimization and model fitting. Another important primitive is sampling, which is used by most forms of inference (i.e. answering probabilistic queries of a learned model).</p>
<p>It turns out that there is a natural analogue of convexity for sampling — <em>log-concavity</em>. Paralleling the state of affairs in optimization, we have a variety of (provably efficient) algorithms for sampling from log-concave distributions, under a variety of access models to the distribution. Log-concavity, however, is very restrictive and cannot model common properties of distributions we frequently wish to sample from in machine learning applications, for example multi-modality and manifold structure in the level sets, which is what we’ll focus on in this and the upcoming post.</p>
<p>Unlike non-convex optimization, the field of sampling beyond log-concavity is very nascent. In this post, we will survey the basic tools and difficulties for sampling beyond log-concavity. In the next post, we will survey recent progress in this direction, in particular with respect to handling multi-modality and manifold structure in the level sets, covering the papers <a href="https://arxiv.org/abs/1812.00793">Simulated tempering Langevin Monte Carlo</a> by Rong Ge, Holden Lee, and Andrej Risteski and <a href="https://arxiv.org/abs/2002.05576">Fast convergence for Langevin diffusion with matrix manifold structure</a> by Ankur Moitra and Andrej Risteski.</p>
<h1 id="formalizing-the-sampling-problem">Formalizing the sampling problem</h1>
<p>The formulation of the sampling problem we will consider is as follows:</p>
<blockquote>
<p><strong>Problem</strong>: Sample from a distribution $p(x) \propto e^{-f(x)}$ given black-box access to $f$ and $\nabla f$.</p>
</blockquote>
<p>This formalization subsumes a lot of inference tasks involving different kinds of probabilistic models. We give several common examples:</p>
<p><em>1.Posterior inference</em>: Suppose our data is generated from a model with <em>unknown</em> parameters $\theta$ , such that the data-generation process is given by $p(x \mid \theta)$ and we have a prior $p(\theta)$ over the model parameters. Then the <em>posterior distribution</em> $p(\theta \mid x)$ , by Bayes’s Rule, is given by</p>
\[p(\theta \mid x) = \frac{p(x \mid \theta)p(x)}{p(x)}\propto p(x \mid \theta)p(\theta).\]
<p>A canonical example of this is a <em>noisy inference task</em> where a signal (parametrized by $\theta$ ) is perturbed by noise (as specified by $p(x \mid \theta)$ ).</p>
<p><em>2.Posteriors in latent-variable models</em>: If the data-generation process has a <em>latent (hidden) variable</em> $h$ associated to each data point, such that $h$ has a <em>known</em> prior $p(h)$ and a <em>known</em> conditional $p_\theta(x \mid h)$ , then again by Bayes’s rule, we have</p>
\[p_\theta(h \mid x) = \frac{p_\theta(x \mid h)p_\theta(h)}{p_\theta(x)}\propto p_\theta(x \mid h)p_\theta(h).\]
<p>In typical latent-variable models, $p_\theta(x \mid h)$ and $p_\theta(h)$ have a simple parametric form, which makes it easy to evaluate $p_\theta(x \mid h)p_\theta(h)$ . Some examples of latent-variable models are mixture models (where $h$ encodes which component a sample came from), topic models (where $h$ denote the topic proportions in a document), and noisy-OR networks (and latent-variable Bayesian belief networks).</p>
<p><em>3.Sampling from energy models</em>: in energy models, the distribution of the data is parametrized as $p(x) \propto \exp(-E(x))$ for some <em>energy</em> function $E(x)$ which is smaller on points in the data distribution. Recent works by <a href="https://arxiv.org/abs/1907.05600">(Song, Ermon 2019)</a> and <a href="https://arxiv.org/abs/1903.08689">(Du, Mordatch 2019)</a> have scaled up the training of these models on images so that the visual quality of the samples they produce is comparable to that of more popular generative models like GANs and flow models.</p>
<p>The “exponential form” $e^{-f(x)}$ is also helpful in making an analogy to optimization. Namely, if we sample from $p(x)\propto e^{-f(x)}$, a particular point $x$ is more likely to be sampled if $f(x)$ is small. The key difference between with optimization is that while in optimization, we only want to get to the minimum, in sampling, we want to pick points with the correct probabilities.</p>
<h1 id="comparison-with-optimization">Comparison with optimization</h1>
<p>The computational hardness landscape for our sampling problem parallels the one for black-box optimization, in which the goal is to find the minimum of a function $f$, given value/gradient oracle access. When $f$ is <em>convex</em>, there is a unique local minimum, so that local search algorithms like <em>gradient descent</em> are efficient. When $f$ is non-convex, gradient descent can get trapped in potentially poor local minima, and in the worst case, an exponential number of queries is needed.</p>
<p>Similarly, for sampling, when $p$ is <em>log-concave</em>, the distribution is unimodal and a Markov Chain which is a close relative of gradient descent — <em>Langevin Monte Carlo</em> — is efficient. When $p$ is non-log-concave, Langevin Monte Carlo can get trapped in one of many modes, and and exponential number of queries may also be needed.</p>
<blockquote>
<p>A distribution $p(x)\propto e^{-f(x)}$ is <strong>log-concave</strong> if $f(x) = -\log p(x)$ is convex. It is $\alpha$-strongly log-concave if $f(x)$ is $\alpha$-strongly convex.</p>
</blockquote>
<p>However, such worst-case hardness rarely stop practitioners from trying to solve the non-convex optimization or non-log-concave sampling problems which are ubiquitous in modern machine learning. Often they manage to do so with great success - for instance, in training deep neural networks, gradient descent and its relatives perform quite well. Similarly, Langevin Monte Carlo and its relatives can do quite well on non-log-concave problems, though they sometimes need to be aided by temperature heuristics and other tricks.</p>
<p>As theorists, we’d like to develop theory that will lead to a better understanding of why and when these heuristics work. Just like we’ve done for optimization, we need to be guided both by hardness results and relevant structure of real-world problems in this endeavour.</p>
<p>The following table summarizes the comparisons we have come up with:</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/table_opt.jpg" alt="" /></p>
<p>Before we move on to non-log-concave distributions, though, we need to understand the basic algorithm for sampling and its guarantees for log-concave distributions.</p>
<h1 id="langevin-monte-carlo">Langevin Monte Carlo</h1>
<p>Just as gradient descent is the canonical algorithm for optimization, <em>Langevin Monte Carlo</em> (LMC) is the canonical algorithm for our sampling problem. In a nutshell, it is gradient descent that also injects Gaussian noise:</p>
\[\text{Gradient descent:}\quad
x_{t+\eta} = x_t - \eta \nabla f(x_t)\]
\[\text{Langevin Monte Carlo:}\quad
x_{t+\eta} = x_t - \eta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I)\]
<p>Both of these processes can be considered as discretizations of a continuous process. For gradient descent, the limit is an <em>ordinary differential equation</em>, and for Langevin Monte Carlo a <em>stochastic differential equation</em>:</p>
\[\text{Gradient flow:} \quad dx_t = -\nabla f(x_t) dt\]
\[\text{Langevin diffusion:} \quad dx_t = -\nabla f(x_t) dt + \sqrt{2} dB_t\]
<p>where $B_t$ denotes Brownian motion of the appropriate dimension.</p>
<p>The crucial property of the above stochastic differential equation is that under fairly mild assumptions on $f$, the stationary distribution is $p(x) \propto e^{-f(x)}$. (If you’re more comfortable with optimization, note that while gradient descent generally converges to (local) minima, the Gaussian noise term prevents LMC from converging to a single point - rather, it converges to a <em>stationary distribution</em>. See animation below.)</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/gd_ld_animated.gif" alt="" /></p>
<p>Langevin Monte Carlo fits in the <em>Markov Chain Monte Carlo</em> (MCMC) paradigm: design a random walk, so that the stationary distribution is the desired distribution. “Mixing” means getting close to the stationary distribution, and rapid mixing means this happens quickly.</p>
<p>Like in optimization, Langevin Monte Carlo is the most “basic” algorithm: for example, one can incorporate “acceleration” and obtain <em>underdamped</em> Langevin, or use the physics-inspired Hamiltonian Monte Carlo.</p>
<h1 id="tools-for-bounding-mixing-time-challenges-beyond-log-concavity">Tools for bounding mixing time, challenges beyond log-concavity</h1>
<p>To illustrate the difficulty in moving beyond log-concavity, we’ll describe the tools that are used to prove fast mixing for log-concave distributions, and where they fall short for non-log-concave distributions.</p>
<p>We will do this by an analogy to how we analyze random walks on graphs. One common way to prove rapid mixing of a random walk on a graph is to show the Laplacian has a spectral gap (equivalently, the transition matrix has a gap between the largest and next-to-largest eigenvalue). The analogue of this for Langevin diffusion is showing a <em>Poincaré inequality</em>. (A spectral gap of $1/C$ corresponds to Poincaré constant of $C$.)</p>
<blockquote>
<p>We say that $p(x)$ satisfies a <strong>Poincaré inequality</strong> with constant $C$ if for all functions $g$ on $\mathbb R^d$ (such that $g$ and $\nabla g$ are square-integrable with respect to $p$),</p>
<div> $$\text{Var}_p(g) \le C \int_{\mathbb R^d} ||\nabla g(x)||^2 p(x)\,dx.$$ </div>
</blockquote>
<p>A small constant $C$ implies fast mixing in $\chi^2$ divergence, which implies fast mixing in total variation distance. More precisely, the mixing time for Langevin diffusion is on the order of $C$. We note that other functional inequalities imply mixing with respect to other measures (such as log-Sobolev inequalities for KL divergence).</p>
<p>While it may not be obvious what the Poincaré inequality has to do with a spectral gap, it turns out that we can think of the right-hand side as a quadratic form involving the <em>infinitesimal generator</em> of Langevin process, which functions as the continuous analogue of a Laplacian for a graph random walk.</p>
<p>The following table shows the analogy: we can put the discrete and continuous processes on the same footing by defining a quadratic form called the Dirichlet form from the Laplacian or infinitesimal generator.</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/table_mixing.jpg" alt="" /></p>
<p>To see how the Poincaré inequality represents a spectral gap in the discrete case, we write it in a more explicit form in a familiar special case: a lazy random walk (i.e. a random walk that with probability $1/2$ stays in the current vertex, and with probability $1/2$ goes to a random neighbor) on a regular graph with $n$ vertices. In this case, $p$ is the uniform distribution, and $v_1=\mathbf 1,\ldots, v_n$ are the eigenvectors of $A$ with eigenvalues $1=\lambda_1\ge \lambda_2\ge \cdots \ge \lambda_n\ge 0$; normalize $v_1,\ldots, v_n$ so they have unit norm with respect to $p$, i.e. $\Vert v_i\Vert_p^2=\frac 1n\sum_j v_{ij}^2=1$.</p>
<p>Writing $g= \sum_i a_i v_i$, since $v_2,\ldots, v_n$ are orthogonal to $v_1=\mathbf 1$, we have $\langle g, \mathbf 1\rangle_p = a_1$, so</p>
\[\text{Var}_p(g) = \frac{1}{n}(\sum_i g_i^2) - a_1^2 = \sum_{i=2}^n a_i^2\]
<p>Furthermore, we have</p>
\[\langle g, Lg \rangle_p = \langle \sum_i a_iv_i, (I- A)(\sum_i a_iv_i)\rangle_p= \sum_{i=2}^n a_i^2(1-\lambda_i)\]
<p>These coefficients are all at most $1-\lambda_2$, i.e. the <em>spectral gap</em>, so</p>
\[\langle g, Lg \rangle_p \ge (1-\lambda_2)\text{Var}_p(g),\]
<p>which shows the Poincaré inequality with constant $(1-\lambda_2)^{-1}$.</p>
<p>A classic theorem establishes a Poincaré inequality for (strongly) log-concave distributions.</p>
<blockquote>
<p><strong>Theorem (Bakry, Emery 1985)</strong>: If $p(x)$ is $\alpha$-strongly log-concave, then $p(x)$ satisfies a Poincaré inequality with constant $\frac1{\alpha}$.</p>
</blockquote>
<p>Hence, for strongly-log-concave distributions, Langevin diffusion mixes rapidly. To complete the picture, a line of recent works, starting with <a href="https://arxiv.org/abs/1412.7392">(Dalalyan 2014)</a> have established bounds for discretization error to obtain algorithmic guarantees for Langevin Monte Carlo.</p>
<p>However, guarantees break down when we don’t assume log-concavity. Generically, algorithms for sampling depend <em>exponentially</em> on the ambient dimension $d$, or on the “size” of the non-log-concave region (e.g., the distance between modes of the distribution). In terms of their dependence on $d$, they are not doing much better than if we split space into cells and sample each according to its probability, similar to “grid search” for optimization. This is unsurprising: we can’t hope for better guarantees without structural assumptions.</p>
<p>Toward this end, in the next blog post we will consider two kinds of structure that allow efficient sampling:</p>
<ol>
<li>Simple multimodal distributions, such as a mixture of gaussians with equal variance.</li>
<li>Manifold structure, arising from symmetries in the level sets of the distribution.</li>
</ol>
Sat, 19 Sep 2020 07:00:00 -0700
http://offconvex.github.io/2020/09/19/beyondlogconvavesampling/
http://offconvex.github.io/2020/09/19/beyondlogconvavesampling/