Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Can implicit regularization in deep learning be explained by norms?<p>This post is based on my <a href="https://arxiv.org/pdf/2005.06398.pdf">recent paper</a> with <a href="https://noamrazin.github.io/">Noam Razin</a> (to appear at NeurIPS 2020), studying the question of whether norms can explain implicit regularization in deep learning.
TL;DR: we argue they cannot.</p>
<h2 id="implicit-regularization--norm-minimization">Implicit regularization = norm minimization?</h2>
<p>Understanding the implicit regularization induced by gradient-based optimization is possibly the biggest challenge facing theoretical deep learning these days.
In classical machine learning we typically regularize via norms, so it seems only natural to hope that in deep learning something similar is happening under the hood, i.e. the implicit regularization strives to find minimal norm solutions.
This is actually the case in the simple setting of overparameterized linear regression $-$ there, by a folklore analysis (cf. <a href="https://openreview.net/pdf?id=Sy8gdB9xx">Zhang et al. 2017</a>), gradient descent (and any other reasonable gradient-based optimizer) initialized at zero is known to converge to the minimal Euclidean norm solution.
A spur of recent works (see <a href="https://arxiv.org/pdf/2005.06398.pdf">our paper</a> for a thorough review) has shown that for various other models an analogous result holds, i.e. gradient descent (when initialized appropriately) converges to solutions that minimize a certain (model-dependent) norm.
On the other hand, as discussed last year in posts by <a href="http://www.offconvex.org/2019/06/03/trajectories/">Sanjeev</a> as well as <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">Wei and myself</a>, mounting theoretical and empirical evidence suggest that it may not be possible to generally describe implicit regularization in deep learning as minimization of norms.
Which is it then?</p>
<h2 id="a-standard-test-bed-matrix-factorization">A standard test-bed: matrix factorization</h2>
<p>A standard test-bed for theoretically studying implicit regularization in deep learning is <em>matrix factorization</em> $-$ matrix completion via linear neural networks.
Wei and I already presented this model in our <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous post</a>, but for self-containedness I will do so again here.</p>
<p>In <em>matrix completion</em>, we are given entries $\{ M_{i, j} : (i, j) \in \Omega \}$ of an unknown matrix $M$, and our job is to recover the remaining entries.
This can be seen as a supervised learning (regression) problem, where the training examples are the observed entries of $M$, the model is a matrix $W$ trained with the loss:
[
\qquad \ell(W) = \sum\nolimits_{(i, j) \in \Omega} (W_{i, j} - M_{i, j})^2 ~, \qquad\qquad \color{purple}{\text{(1)}}
]
and generalization corresponds to how similar $W$ is to $M$ in the unobserved locations.
In order for the problem to be well-posed, we have to assume something about $M$ (otherwise the unobserved locations can hold any values, and guaranteeing generalization is impossible).
The standard assumption (which has many <a href="https://en.wikipedia.org/wiki/Matrix_completion#Applications">practical applications</a>) is that $M$ has low rank, meaning the goal is to find, among all global minima of the loss $\ell(W)$, one with minimal rank.
The classic algorithm for achieving this is <a href="https://en.wikipedia.org/wiki/Matrix_norm#Schatten_norms"><em>nuclear norm</em></a> minimization $-$ a convex program which, given enough observed entries and under certain technical assumptions (“incoherence”), recovers $M$ exactly (cf. <a href="https://statweb.stanford.edu/~candes/papers/MatrixCompletion.pdf">Candes and Recht</a>).</p>
<p>Matrix factorization represents an alternative, deep learning approach to matrix completion.
The idea is to use a <em>linear neural network</em> (fully-connected neural network with linear activation), and optimize the resulting objective via gradient descent (GD).
More specifically, rather than working with the loss $\ell(W)$ directly, we choose a depth $L \in \mathbb{N}$, and run GD on the <em>overparameterized objective</em>:
[
\phi ( W_1 , W_2 , \ldots , W_L ) := \ell ( W_L W_{L - 1} \cdots W_1) ~. ~~\qquad~ \color{purple}{\text{(2)}}
]
Our solution to the matrix completion problem is then:
[
\qquad\qquad W_{L : 1} := W_L W_{L - 1} \cdots W_1 ~, \qquad\qquad\qquad \color{purple}{\text{(3)}}
]
which we refer to as the <em>product matrix</em>.
While (for $L \geq 2$) it is possible to constrain the rank of $W_{L : 1}$ by limiting dimensions of the parameter matrices $\{ W_j \}_j$, from an implicit regularization standpoint, the case of interest is where rank is unconstrained (i.e. dimensions of $\{ W_j \}_j$ are large enough for $W_{L : 1}$ to take on any value).
In this case there is <em>no explicit regularization</em>, and the kind of solution GD will converge to is determined implicitly by the parameterization.
The degenerate case $L = 1$ is obviously uninteresting (nothing is learned in the unobserved locations), but what happens when depth is added ($L \geq 2$)?</p>
<p>In their <a href="https://papers.nips.cc/paper/2017/file/58191d2a914c6dae66371c9dcdc91b41-Paper.pdf">NeurIPS 2017 paper</a>, Gunasekar et al. showed empirically that with depth $L = 2$, if GD is run with small learning rate starting from near-zero initialization, then the implicit regularization in matrix factorization tends to produce low-rank solutions (yielding good generalization under the standard assumption of $M$ having low rank).
They conjectured that behind the scenes, what takes place is the classic nuclear norm minimization algorithm:</p>
<blockquote>
<p><strong>Conjecture 1 (<a href="https://papers.nips.cc/paper/7195-implicit-regularization-in-matrix-factorization.pdf">Gunasekar et al. 2017</a>; informally stated):</strong>
GD (with small learning rate and near-zero initialization) over a depth $L = 2$ matrix factorization finds solution with minimum nuclear norm.</p>
</blockquote>
<p>Moreover, they were able to prove the conjecture in a certain restricted setting, and others (e.g. <a href="http://proceedings.mlr.press/v75/li18a/li18a.pdf">Li et al. 2018</a>) later derived proofs for additional specific cases.</p>
<p>Two years after Conjecture 1 was made, in a <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">NeurIPS 2019 paper</a> with Sanjeev, Wei and Yuping Luo, we presented empirical and theoretical evidence (see <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous blog post</a> for details) which led us to hypothesize the opposite, namely, that for any depth $L \geq 2$, the implicit regularization in matrix factorization can <em>not</em> be described as minimization of a norm:</p>
<blockquote>
<p><strong>Conjecture 2 (<a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">Arora et al. 2019</a>; informally stated):</strong>
Given a depth $L \geq 2$ matrix factorization, for any norm $\|{\cdot}\|$, there exist matrix completion tasks on which GD (with small learning rate and near-zero initialization) finds solution that does not minimize $\|{\cdot}\|$.</p>
</blockquote>
<p>Due to technical subtleties in their formal statements, Conjectures 1 and 2 do not necessarily contradict.
However, they represent opposite views on the question of whether or not norms can explain implicit regularization in matrix factorization.
The goal of my recent work with <a href="https://noamrazin.github.io/">Noam</a> was to resolve this open question.</p>
<h2 id="implicit-regularization-can-drive-all-norms-to-infinity">Implicit regularization can drive all norms to infinity</h2>
<p>The main result in our <a href="https://arxiv.org/pdf/2005.06398.pdf">paper</a> is a proof that there exist simple matrix completion settings where the implicit regularization in matrix factorization drives <strong><em>all norms towards infinity</em></strong>.
By this we affirm Conjecture 2, and in fact go beyond it in the following sense:
<em>(i)</em> not only is each norm disqualified by some setting, but there are actually settings that jointly disqualify all norms;
and
<em>(ii)</em> not only are norms not necessarily minimized, but they can grow towards infinity.</p>
<p>The idea behind our analysis is remarkably simple.
We prove the following:</p>
<blockquote>
<p><strong>Theorem (informally stated):</strong>
During GD over matrix factorization (i.e. over $\phi ( W_1 , W_2 , \ldots , W_L)$ defined by Equations $\color{purple}{\text(1)}$ and $\color{purple}{\text(2)}$), if learning rate is sufficiently small and initialization sufficiently close to the origin, then the determinant of the product matrix $W_{1: L}$ (Equation $\color{purple}{\text(3)}$) doesn’t change sign.</p>
</blockquote>
<p>A corollary is that if $\det ( W_{L : 1} )$ is positive at initialization (an event whose probability is $0.5$ under any reasonable initialization scheme), then it stays that way throughout.
This seemingly benign observation has far-reaching implications.
As a simple example, consider the following matrix completion problem ($*$ here stands for unobserved entry):
[
\qquad\qquad
\begin{pmatrix}
* & 1 \newline
1 & 0
\end{pmatrix}
~. \qquad\qquad \color{purple}{\text{(4)}}
]
Every solution to this problem, i.e. every matrix that agrees with its observations, must have determinant $-1$.
It is therefore only logical to expect that when solving the problem using matrix factorization, the determinant of the product matrix $W_{L : 1}$ will converge to $-1$.
On the other hand, we know that (with probability $0.5$ over initialization) $\det ( W_{L : 1} )$ is always positive, so what is going on?
This conundrum can only mean one thing $-$ as $W_{L : 1}$ fits the observations, its value in the unobserved location (i.e. $(W_{L : 1})_{11}$) diverges to infinity, which implies that <em>all norms grow to infinity!</em></p>
<p>The above idea goes way beyond the simple example given in Equation $\color{purple}{\text(4)}$.
We use it to prove that in a wide array of matrix completion settings, the implicit regularization in matrix factorization leads norms to <em>increase</em>.
We also demonstrate it empirically, showing that in such settings unobserved entries grow during optimization.
Here’s the result of an experiment with the setting of Equation $\color{purple}{\text(4)}$:</p>
<div style="text-align:center;">
<img style="width:300px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_mf_exp.png" />
<br />
<i><b>Figure 1:</b>
Solving matrix completion problem defined by Equation $\color{purple}{\text(4)}$ using matrix factorization leads absolute value of unobserved entry to increase (which in turn means norms increase) as loss decreases.
</i>
</div>
<h2 id="what-is-happening-then">What is happening then?</h2>
<p>If the implicit regularization in matrix factorization is not minimizing a norm, what is it doing?
While a complete theoretical characterization is still lacking, there are signs that a potentially useful interpretation is <strong><em>minimization of rank</em></strong>.
In our aforementioned <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">NeurIPS 2019 paper</a>, we derived a dynamical characterization (and showed supporting experiments) suggesting that matrix factorization is implicitly conducting some kind of greedy low-rank search (see <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous blog post</a> for details).
This phenomenon actually facilitated a new autoencoding architecture suggested in a recent <a href="https://arxiv.org/pdf/2010.00679.pdf">empirical paper</a> (to appear at NeurIPS 2020) by Yann LeCun and his team at Facebook AI.
Going back to the example in Equation $\color{purple}{\text(4)}$, notice that in this matrix completion problem all solutions have rank $2$, but it is possible to essentially minimize rank to $1$ by taking (absolute value of) unobserved entry to infinity.
As we’ve seen, this is exactly what the implicit regularization in matrix factorization does!</p>
<p>Intrigued by the rank minimization viewpoint, <a href="https://noamrazin.github.io/">Noam</a> and I empirically explored an extension of matrix factorization to <em>tensor factorization</em>.
Tensors can be thought of as high dimensional arrays, and they admit natural factorizations similarly to matrices (two dimensional arrays).
We found that on the task of <em>tensor completion</em> (defined analogously to matrix completion $-$ see Equation $\color{purple}{\text(1)}$ and surrounding text), GD on a tensor factorization tends to produce solutions with low rank, where rank is defined in the context of tensors (for a formal definition, and a general intro to tensors and their factorizations, see this <a href="http://www.kolda.net/publication/TensorReview.pdf">excellent survey</a> by Kolda and Bader).
That is, just like in matrix factorization, the implicit regularization in tensor factorization also strives to minimize rank!
Here’s a representative result from one of our experiments:</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_tf_exp.png" />
<br />
<i><b>Figure 2:</b>
In analogy with matrix factorization, the implicit regularization of tensor factorization (high dimensional extension) strives to find a low (tensor) rank solution.
Plots show reconstruction error and (tensor) rank of final solution on multiple tensor completion problems differing in the number of observations.
GD over tensor factorization is compared against "linear" method $-$ GD over direct parameterization of tensor initialized at zero (this is equivalent to fitting observations while placing zeros in unobserved locations).
</i>
<br />
<br />
</div>
<p>So what can tensor factorizations tell us about deep learning?
It turns out that, similarly to how matrix factorizations correspond to prediction of matrix entries via linear neural networks, tensor factorizations can be seen as prediction of tensor entries with a certain type of <em>non-linear</em> neural networks, named <em>convolutional arithmetic circuits</em> (in my PhD I worked a lot on analyzing the expressive power of these models, as well as showing that they work well in practice $-$ see this <a href="https://arxiv.org/pdf/1705.02302.pdf">survey</a> for a soft overview).</p>
<div style="text-align:center;">
<img style="width:900px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_mf_lnn_tf_cac.png" />
<br />
<i><b>Figure 3:</b>
The equivalence between matrix factorizations and linear neural networks extends to an equivalence between tensor factorizations and a certain type of non-linear neural networks named convolutional arithmetic circuits.
</i>
<br />
<br />
</div>
<p>Analogously to how the input-output mapping of a linear neural network can be thought of as a matrix, that of a convolutional
arithmetic circuit is naturally represented by a tensor.
The experiment reported in Figure 2 (and similar ones presented in <a href="https://arxiv.org/pdf/2005.06398.pdf">our paper</a>) thus provides a second example of a neural network architecture whose implicit regularization strives to lower a notion of rank for its input-output mapping.
This leads us to believe that implicit rank minimization may be a general phenomenon, and developing notions of rank for input-output mappings of contemporary models may be key to explaining generalization in deep learning.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Fri, 27 Nov 2020 01:00:00 -0800
http://offconvex.github.io/2020/11/27/reg_dl_not_norm/
http://offconvex.github.io/2020/11/27/reg_dl_not_norm/How to allow deep learning on your data without revealing the data<p>Today’s online world and the emerging internet of things is built around a Faustian bargain: consumers (and their internet of things) hand over their data, and in return get customization of the world to their needs. Is this exchange of privacy for convenience inherent? At first sight one sees no way around because, of course, to allow machine learning on our data we have to hand our data over to the training algorithm.</p>
<p>Similar issues arise in settings other than consumer devices. For instance, hospitals may wish to pool together their patient data to train a large deep model. But privacy laws such as HIPAA forbid them from sharing the data itself, so somehow they have to train a deep net on their data without revealing their data. Frameworks such as Federated Learning (<a href="https://arxiv.org/abs/1610.05492">Konečný et al., 2016</a>) have been proposed for this but it is known that sharing gradients in that environment leaks a lot of information about the data (<a href="https://arxiv.org/abs/1906.08935">Zhu et al., 2019</a>).</p>
<p>Methods to achieve some of the above so could completely change the privacy/utility tradeoffs implicit in today’s organization of the online world.</p>
<p>This blog post discusses the current set of solutions, how they don’t quite suffice for above questions, and the story of a new solution, <a href="http://arxiv.org/abs/2010.02772">InstaHide</a>, that we proposed, and takeaways from a recent attack on it by Carlini et al.</p>
<h2 id="existing-solutions-in-cryptography">Existing solutions in Cryptography</h2>
<p>Classic solutions in cryptography do allow you to in principle outsource any computation to the cloud without revealing your data. (A modern method is Fully Homomorphic Encryption.) Adapting these ideas to machine learning presents two major obstacles: (a) (serious issue) huge computational overhead, which essentially rules it out for today’s large scale deep models (b) (less serious issue) need for special setups —e.g., requiring every user to sign up for public-key encryption.</p>
<p>Significant research efforts are being made to try to overcome these obstacles and we won’t survey them here.</p>
<h2 id="differential-privacy-dp">Differential Privacy (DP)</h2>
<p>Differential privacy (<a href="https://www.iacr.org/archive/eurocrypt2006/40040493/40040493.pdf">Dwork et al., 2006</a>, <a href="https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf">Dwork&Roth, 2014</a>) involves adding carefully calculated amounts of noise during training. This is a modern and rigorous version of classic <em>data anonymization</em> techniques whose canonical application is release of noised census data to protect privacy of individuals.</p>
<p>This notion was adapted to machine learning by positing that “privacy” in machine learning refers to trained classifiers not being dependent on data of individuals. In other words, if the classifier is trained on data from N individuals, it’s behavior should be essentially unchanged (statistically speaking) if we omit data from any individual. Note that this is a weak notion of privacy: it does not in any way hide the data from the company.</p>
<p>Many tech companies have adopted differential privacy in deployed systems but the following two caveats are important.</p>
<blockquote>
<p>(Caveat 1): In deep learning applications, DP’s provable guarantees are very weak.</p>
</blockquote>
<p>Applying DP to deep learning involves noticing that the gradient computation amounts to adding gradients of the loss corresponding to individual data points, and that adding noise to those individual gradients in calculated doses can help make the overall classifier limit its dependence on the individual’s datapoint.</p>
<p>In practice provable bounds require adding so much gradient noise that accuracy of the trained classifier plummets. We do not know of any successful training that achieved accuracy > 75 percent on CIFAR10 (or any that achieved accuracy even 10 percent on ImageNet). Furthermore, achieving this level of accuracy involves <strong>pretraining</strong> the classifier model on a large set of <strong>public</strong> images and then using the private/protected images only to fine-tune the parameters.</p>
<p>Thus it is no surprise that firms today usually apply DP with very low noise level, which give essentially no guarantees. Which brings us to:</p>
<blockquote>
<p>(Caveat 2): DP’s guarantees (and even weaker guarantees applying to deployment scenarios) possibly act as a fig leaf that allows firms to not address the kinds of privacy violations that the person on the street actually worries about.</p>
</blockquote>
<p>DP’s provable guarantee (which as noted, does not hold in deployed systems due to the low noise level used) would only ensure that a deployed ML software that was trained with data from tens of millions of users will not change its behavior depending upon private information of any single user.</p>
<p>But that threat model would seem remote to the person on the street. The privacy issue they worry about more is that copious amounts of our data are continuously collected/stored/mined/sold, often by entities we do not even know about. While lax regulation is primarily to blame, there is also the technical hurdle that there is no <strong>practical way</strong> for consumers to hide their data while at the same time benefiting from customized ML solutions that improve their lives.</p>
<p>Which brings us to the question we started with: <em>Could consumers allow machine learning to be done on their data without revealing their data?</em></p>
<h2 id="a-proposed-solution-instahide">A proposed solution: InstaHide</h2>
<p>InstaHide is a new concept: it hides or “encrypts” images to protect them somewhat, while still allowing standard deep learning pipelines to be applied on them. The deep model is trained entirely on encrypted images.</p>
<ul>
<li>
<p>The training speed and accuracy is only slightly worse than vanilla training: one can achieve a test accuracy of ~ 90 percent on CIFAR10 using encrypted images with a computation overhead $< 5$ percent.</p>
</li>
<li>
<p>When it comes to privacy, like every other form of cryptography, its security is based upon conjectured difficulty of the underlying computational problem.
(But we don’t expect breaking it to be as difficult as say breaking RSA.)</p>
</li>
</ul>
<h3 id="how-instahide-encryption-works">How InstaHide encryption works</h3>
<p>Here are some details. InstaHide belongs to the class of subset-sum type encryptions (<a href="https://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/biwx.pdf">Bhattacharyya et al., 2011</a>), and was inspired by a data augmentation technique called Mixup (<a href="https://arxiv.org/abs/1710.09412">Zhang et al., 2018</a>). It views images as vectors of pixel values. With vectors you can take linear combinations. The figure below shows the result of a typical MixUp: adding 0.6 times the bird image with 0.4 times the airplane image. The image labels can also be treated as one-hot vectors, and they are mixed using the same coefficients in front of the image samples.</p>
<p style="text-align:center;">
<img src="/assets/mixup.png" width="60%" />
</p>
<p>To encrypt the bird image, InstaHide does mixup (i.e., combination with nonnegative coefficients) with one other randomly chosen training image, and with two other images chosen randomly from a large public dataset like imagenet. The coefficients 0.6., 0.4 etc. in the figure are also chosen at random. Then it takes this composite image and for every pixel value, it randomly flips the sign. With that, we get the encrypted images and labels. All random choices made in this encryption act as a one-time key that is never re-used to encrypt other images.</p>
<p>InstaHide has a parameter $k$ denoting how many images are mixed; in the picture, we have $k=4$. The figure below shows this encryption mechanism.</p>
<p style="text-align:center;">
<img src="/assets/instahide.png" width="80%" />
</p>
<p>When plugged into the standard deep learning with a private dataset of $n$ images, in each epoch of training (say $T$ epochs in total), InstaHide will re-encrypt each image in the dataset using a random one-time key. This will gives $n\times T$ encrypted images in total.</p>
<h3 id="the-security-argument">The security argument</h3>
<p>We conjectured, based upon intuitions from computational complexity of the k-vector-subset-sum problem (citations), that extracting information about the images could time $N^{k-2}$. Here $N$, the size of the public dataset, can be tens or hundreds of millions, so it might be infeasible for real-life attackers.</p>
<p>We also released a <a href="https://github.com/Hazelsuko07/InstaHide_Challenge">challenge dataset</a> with $k=6, n=100, T=50$ to enable further investigation of InstaHide’s security.</p>
<h2 id="carlini-et-als-recent-attack-on-instahide">Carlini et al.’s recent attack on InstaHide</h2>
<p>Recently, Carlini et al. have shared with us a manuscript with a two-step reconstruction attack (<a href="https://arxiv.org/pdf/2011.05315.pdf">Carlini et al., 2020</a>) against InstaHide.</p>
<p><strong><em>TL;DR: They used 11 hours on Google’s best GPUs to get partial recovery of our 100 challenge encryptions and 120 CPU hours to break the encryption completely. Furthermore, the latter was possible entirely because we used an insecure random number generator, and they used exhaustive search over random seeds.</em></strong></p>
<p>Now the details.</p>
<p>The attack takes $n\times T$ InstaHide-encrypted images as the input, ($n$ is the size of the private dataset, $T$ is the number of training epochs), and returns a reconstruction of the private dataset. It goes as follows.</p>
<ul>
<li>
<p>Map $n \times T$ encryptions into $n$ private images, by clustering encryptions of a same private image as a group. This is achieved by firstly building a graph representing pairwise similarity between encrypted images, and then assign each encryption a private image. In their implementation, they train a neural network to annotate pairwise similarity between encryptions.</p>
</li>
<li>
<p>Then, given the encrypted images and the mapping, they solve a nonlinear optimization problem via gradient desent to recover an approximation of the original private dataset.</p>
</li>
</ul>
<p>Using Google’s powerful GPU, it took them 10 hours to train the neural network for similarity annotation, and about another hour to get an approximation of our challenge set of $100$ images with $k=6, n=100, T=50$. This gave them vaguely correct images, with significant unclear areas and color shift.</p>
<p>They also proposed a different strategy which abuses the vulnerability of NumPy and PyTorch’s random number generator (<em>Aargh; we didn’t use a secure random number generator.</em>) They did brute force search of $2^{32}$ possible initial random seeds, which allows them to reproduce the randomness during encryption, and thus perform a pixel-perfect reconstruction. As they reported, this attack takes 120 CPU hours (they parallelize across 100 cores to obtain the solution in a little over an hour). We will have this implementation flaw fixed in an updated version.</p>
<h3 id="thoughts-on-this-attack">Thoughts on this attack</h3>
<p>Though the attack is clever and impressive, we feel that the long-term take-away is still unclear for several reasons.</p>
<blockquote>
<p>Variants of InstaHide seem to evade the attack.</p>
</blockquote>
<p>The challenge set contained 50 encryptions each of 100 images. This corresponds to using encrypted images for 50 epochs. But as done in existing settings that use DP, one can pretrain the deep model using non-private images and then fine-tune it with fewer epochs of the private images. Using a similar pipeline DPSGD (<a href="https://arxiv.org/abs/1607.00133">Abadi et al., 2016</a>), by pretraining a ResNet-18 on CIFAR100 (the public dataset) and finetuning for $10$ epochs on CIFAR10 (the private dataset) gives accuracy of 83 percent, still far better than any provable guarantees using DP on this dataset. Carlini et al.\ team conceded that their attack probably would not work in this setting.</p>
<p>Similarly using InstaHide purely at inference time (i.e., using ML, instead of training ML) still should be completely secure since only one encryption of the image is released. The new attack can’t work here at all.</p>
<blockquote>
<p>InstaHide was never intended to be a mission-critical encryption like RSA (which by the way also has no provable guarantees).</p>
</blockquote>
<p>InstaHide is designed to give users and the internet of things a <em>light-weight</em> encryption method that allows them to use machine learning without giving eavesdroppers or servers access to their raw data. There is no other cost-effective alternative to InstaHide for this application. If it takes Google’s powerful computers a few hours to break our challenge set of 100 images, this is not yet a cost-effective attack in the intended settings.</p>
<p>More important, the challenge dataset corresponded to an ambitious form of security, where the encrypted images themselves are released to the world. The more typical application is a Federated Learning (<a href="https://arxiv.org/abs/1610.05492">Konečný et al., 2016</a>) scenario: the adversary observes shared gradients that are computed using encrypted images (he also has access to the trained model). The attacks in this paper do not currently apply to that scenario. This is also the idea in <a href="https://arxiv.org/abs/2010.06053"><strong>TextHide</strong></a>, an adaptation of InstaHide to text data.</p>
<h2 id="takeways">Takeways</h2>
<p>Users need lightweight encryptions that can be applied in real time to large amounts of data, and yet allow them to take benefit of Machine Learning on the cloud. Methods to do so could completely change the privacy/utility tradeoffs implicitly assumed in today’s tech world.</p>
<p>InstaHide is the only such tool right now, and we now know that it provides moderate security that may be enough for many applications.</p>
<!--
### References
[1] [**InstaHide: Instance-hiding Schemes for Private Distributed Learning**](http://arxiv.org/abs/2010.02772), *Yangsibo Huang, Zhao Song, Kai Li, Sanjeev Arora*, ICML 2020
[2] [**mixup: Beyond Empirical Risk Minimization**](https://arxiv.org/abs/1710.09412), *Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz*, ICLR 2018
[3] [**An Attack on InstaHide: Is Private Learning Possible with Instance Encoding?**](https://arxiv.org/pdf/2011.05315.pdf) *Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramèr*, arxiv preprint
[4] [**Deep Learning with Differential Privacy**](https://arxiv.org/abs/1607.00133), *Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang*, ACM CCS 2016
[5] [**Federated learning: Strategies for improving communication efficiency**](https://arxiv.org/abs/1610.05492), *Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, Dave Bacon*, NeurIPS Workshop 2016
[6] [**A method for obtaining digital signatures and public-key cryptosystems**](https://people.csail.mit.edu/rivest/Rsapaper.pdf), *R.L. Rivest, A. Shamir, and L. Adleman*, Communications of the ACM 1978
[7] [**Deep leakage from gradients**](https://arxiv.org/abs/1906.08935), *Ligeng Zhu, Zhijian Liu, and Song Han.* Neurips19. -->
Wed, 11 Nov 2020 02:00:00 -0800
http://offconvex.github.io/2020/11/11/instahide/
http://offconvex.github.io/2020/11/11/instahide/Mismatches between Traditional Optimization Analyses and Modern Deep Learning<p>You may remember our <a href="http://www.offconvex.org/2020/04/24/ExpLR1/">previous blog post</a> showing that it is possible to do state-of-the-art deep learning with learning rate that increases exponentially during training. It was meant to be a dramatic illustration that what we learned in optimization classes and books isn’t always a good fit for modern deep learning, specifically, <em>normalized nets</em>, which is our term for nets that use any one of popular normalization schemes,e.g. <a href="https://arxiv.org/abs/1502.03167">BatchNorm (BN)</a>, <a href="https://arxiv.org/abs/1803.08494">GroupNorm (GN)</a>, <a href="https://arxiv.org/abs/1602.07868">WeightNorm (WN)</a>. Today’s post (based upon <a href="https://arxiv.org/abs/2010.02916">our paper</a> with Kaifeng Lyu at NeurIPS20) identifies other surprising incompatibilities between normalized nets and traditional analyses. We hope this will change the way you teach and think about deep learning!</p>
<p>Before diving into the results, we recall that normalized nets are typically trained with weight decay (aka $\ell_2$ regularization). Thus the $t$th iteration of Stochastic Gradient Descent (SGD) is:</p>
\[w_{t+1} \gets (1-\eta_t\lambda)w_t - \eta_t \nabla \mathcal{L}(w_t; \mathcal{B}_t),\]
<p>where $\lambda$ is the weight decay (WD) factor (or $\ell_2$-regularization coefficient), $\eta_t$ the learning rate, $\mathcal{B}_t$ the batch, and $\nabla \mathcal{L}(w_t,\mathcal{B}_t)$ the batch gradient.</p>
<p>As sketched in our previous blog post, under fairly mild assumptions (namely, fixing the top layer during random initialization —which empirically does not hurt final accuracy) the loss function for training such normalized nets is <em>scale invariant</em>, which means $\mathcal{L}(w _ t; \mathcal{B}_ t)=\mathcal{L}(cw _ t; \mathcal{B} _ t)$, $\forall c>0$.</p>
<p>A consequence of scale invariance is that the $ \nabla _ w \mathcal{L} \vert _ {w = w _ 0} = c \nabla _ w \mathcal{L}\vert _ {w = cw _ 0}$ and $\nabla ^ 2 _ w \mathcal{L} \vert _ {w = w _ 0} = c ^ 2 \nabla ^ 2 _ w \mathcal{L} \vert _ {w = cw _ 0}$, for any $c>0$.</p>
<h2 id="some-conventional-wisdoms-cws">Some Conventional Wisdoms (CWs)</h2>
<p>Now we briefly describe some conventional wisdoms. Needless to say, by the end of this post these will turn out to be very very suspect! Possibly they were OK in earlier days of deep learning, and with shallower nets.</p>
<blockquote>
<p>CW 1) As we reduce LR to zero, optimization dynamic converges to a deterministic path (Gradient Flow) along which training loss strictly decreases.</p>
</blockquote>
<p>Recall that in traditional explanation of (deterministic) gradient descent, if LR is smaller than roughly the inverse of the smoothness of the loss function, then each step reduces the loss. SGD, being stochastic, has a distribution over possible paths. But very tiny LR can be thought of as full-batch Gradient Descent (GD), which in the limit of infinitesimal step size approaches Gradient Flow (GF).</p>
<p>The above reasoning shows very small LR is guaranteed to decrease the loss at least, as well as any higher LR, can. Of course, in deep learning, we care not only about optimization but also generalization. Here small LR is believed to hurt.</p>
<blockquote>
<p>CW 2) To achieve the best generalization the LR must be large initially for quite a few epochs.</p>
</blockquote>
<p>This is primarily an empirical finding: using too-small learning rates or too-large batch sizes from the start (all other hyper-parameters being fixed) is known to lead to worse generalization (<a href="https://arxiv.org/pdf/1206.5533.pdf">Bengio, 2012</a>; <a href="https://arxiv.org/abs/1609.04836">Keskar et al., 2017</a>).</p>
<p>A popular explanation for this phenomenon is that the noise in gradient estimation during SGD is beneficial for generalization. (As noted, this noise tends to average out when LR is very small.) Many authors have suggested that the noise helps becauses it keeps the trajectory away from sharp minima which are believed to generalize worse, although there is some difference of opinion here (<a href="http://www.bioinf.jku.at/publications/older/3304.pdf">Hochreiter&Schmidhuber, 1997</a>; <a href="https://arxiv.org/abs/1609.04836">Keskar et al., 2017</a>; <a href="https://arxiv.org/abs/1712.09913">Li et al., 2018</a>; <a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>; <a href="https://arxiv.org/pdf/1902.00744.pdf">He et al., 2019</a>). <a href="https://arxiv.org/abs/1907.04595">Li et al., 2019</a> also gave an example (a simple two-layer net) where this observation of worse generalization due to small LR is mathematically proved and also experimentally verified.</p>
<blockquote>
<p>CW 3) Modeling SGD via a Stochastic Differential Equation (SDE) in the continuous-time limit with a fixed Gaussian noise. Namely, think of SGD as a diffusion process that <strong>mixes</strong> to some Gibbs-like distribution on trained nets.</p>
</blockquote>
<p>This is the usual approach to formal understanding of CW 2 (<a href="https://arxiv.org/abs/1710.06451">Smith&Le, 2018</a>; <a href="https://arxiv.org/abs/1710.11029">Chaudhari&Soatto, 2018</a>; <a href="https://arxiv.org/abs/2004.06977">Shi et al., 2020</a>). The idea is that SGD is gradient descent with a noise term, which has a continuous-time approximation as a diffusion process described as</p>
\[dW_t = - \eta_t \lambda W_t dt - \eta_t \nabla \mathcal{L}(W_t) dt + \eta_t \Sigma_{W_t}^{1/2} dB_t,\]
<p>where $\sigma_{W_t}$ is the covariance of stochastic gradient $ \nabla \mathcal{L}(w_t; \mathcal{B}_t)$, and $B_t$ denotes Brownian motion of the appropriate dimension. Several works have adopted this SDE view and given some rigorous analysis of the effect of noise.</p>
<p>In this story, SGD turns into a geometric random walk in the landscape, which can in principle explore the landscape more thoroughly, for instance by occasionally making loss-increasing steps. While an appealing view, rigorous analysis is difficult because we lack a mathematical description of the loss landscape. Various papers assume the noise in SDE is isotropic Gaussian, and then derive an expression for the stationary distribution of the random walk in terms of the familiar Gibbs distribution. This view gives intuitively appealing explanation of some deep learning phenomena since the magnitude of noise (which is related to LR and batch size) controls the convergence speed and other properties. For instance it’s well-known that this SDE approximation implies the well-known <em>linear scaling rule</em> (Goyal et. al., 2017](https://arxiv.org/pdf/1706.02677.pdf)).</p>
<p>Which raises the question: <em>does SGD really behave like a diffusion process that mixes in the loss landscape?</em></p>
<!--[A few lines explaining for why noise term has this form? e.g., show one step discretization]!-->
<h2 id="conventional-wisdom-challenged">Conventional Wisdom challenged</h2>
<p>We now describe the actual discoveries for normalized nets, which show that the above CW’s are quite off.</p>
<blockquote>
<p>(Against CW1): Full batch gradient descent $\neq$ gradient flow.</p>
</blockquote>
<p>It’s well known that if LR is smaller than the inverse of the smoothness, then the trajectory of gradient descent will be close to that of gradient flow. But for normalized networks, the loss function is scale-invariant and thus provably non-smooth (i.e., smoothness becomes unbounded) around the origin (<a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>). We show that this non-smoothness issue is very real and makes training unstable and even chaotic for full batch SGD with any nonzero learning rate. This occurs both empirically and provably so with some toy losses.</p>
<div style="text-align:center;">
<img style="width:60%;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/gd_not_gf.png" />
</div>
<p><strong>Figure 1.</strong> WD makes GD on scale-invariant loss unstable and chaotic.
(a) Toy model with scale-invariant loss $L(x,y) = \frac{x^2}{x^2+y^2}$ (b)(c) Convergence never truly happens for ResNet trained on sub-sampled
CIFAR10 containing 1000 images with full-batch GD (without momentum). ResNet
can easily get to 100% training accuracy but then veers off. When WD is turned off at epoch 30000 it converges.</p>
<p>Note that WD plays a crucial role in this effect since without WD the parameter norm increases monotonically
(<a href="https://arxiv.org/abs/1812.03981">Arora et al., 2018</a>) which implies SGD moves away from the origin at all times.</p>
<p>Savvy readers might wonder whether using a smaller LR could fix this issue. Unfortunately, getting close to the origin is unavoidable because once the gradient gets small, WD will dominate the dynamics and decrease the norm at a geometric rate, causing the gradient to rise again due to the scale invariance! (This happens so long as the gradient gets arbitrarily small, but not actually zero, as is the case in practice.)</p>
<p>In fact, this is an excellent (and rare) place where early stopping is necessary even for correct optimization of the loss.</p>
<blockquote>
<p>(Against CW 2) Small LR can generalize equally well as large LR.</p>
</blockquote>
<p>This actually was a prediction of the new theoretical analysis we came up with. We ran extensive experiments to test this prediction and found that initial large LR is <strong>not necessary</strong> to match the best performance, even when <em>all the other hyperparameters are fixed</em>. See Figure 2.</p>
<div style="text-align:center;">
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/blog_sgd_8000_test_acc.png" />
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/blog_sgd_8000_train_acc.png" />
</div>
<p><strong>Figure 2</strong>. ResNet trained on CIFAR10 with SGD with normal LR schedule (baseline) as well as a schedule with 100 times smaller initial LR. The latter matches performance of baseline after one more LR decay! Note it needs 5000 epochs which is 10x higher! See our paper for details. (Batch size is 128, WD is 0.0005, and LR is divided by 10 for each decay.)</p>
<p>Note the surprise here is that generalization was not hurt from drastically smaller LR even <em>when no other hyperparameter changes</em>. It is known empirically as well as rigorously (Lemma 2.4 in <a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>) that it is possible to compensate for small LR by other hyperparameter changes.</p>
<blockquote>
<p>(Against Wisdom 3) Random walk/SDE view of SGD is way off. There is no evidence of mixing as traditionally understood, at least within normal training times.</p>
</blockquote>
<p>Actually the evidence against global mixing exists already via the phenomenon of Stochastic Weight Averaging (SWA) (<a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>). Along the trajectory of SGD, if the network parameters from two different epochs are averaged, then the average has test loss lower than either. Improvement via averaging continues to work for run times 10X longer than usual as shown in Figure 3. However, the accuracy improvement doesn’t happen for SWA between two solutions obtained from different initialization. Thus checking whether SWA holds distinguishes between pairs of solutions drawn from the same trajectory and pairs drawn from different trajectories, which shows the diffusion process hasn’t mixed to stationary distribution within normal training times. (This is not surprising, since the theoretical analysis of mixing does not suggest it happens rapidly at all.)</p>
<div style="text-align:center;">
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/swa_sgd_test_acc.png" />
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/swa_sgd_dist.png" />
</div>
<p><strong>Figure 3</strong>. Stochastic Weight Averaging improves the test accuracy of ResNet trained with
SGD on CIFAR10. <strong>Left:</strong> Test accuracy. <strong>Right:</strong> Pairwise distance between parameters from different epochs.</p>
<p>Actually <a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a> already noticed the implication that SWA rules out that SGD is a diffusion process which mixes to a unique global equilibrium. They suggested instead that perhaps the trajectory of SGD could be well-approximated by a multivariate Ornstein-Uhlenbeck (OU) process around the <em>local minimizer</em> $W^ * $, assuming the loss surface is locally strongly convex. As the corresponding stationary is multi-dimensional Gaussian, $N(W^ *, \Sigma)$, around the local minimizer, $W^ *$, this explains why SWA helps to reduce the training loss.</p>
<p>However, we note that (<a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>)’s suggestion is also refuted by the fact that we can show $\ell_2$ distance between weights from epochs $T$ and $T+\Delta$ monotonically increases with $\Delta$ for every $T$ (See Figure 3), while $ \mathbf{E} [ | W_ T-W_ {T+\Delta} |^2]$ should converge to the constant $2Tr[\Sigma]$ as $T, \Delta \to +\infty$ in the OU process. This suggests that all these weights are correlated, unlike the hypothesized OU process.</p>
<h2 id="so-whats-really-going-on">So what’s really going on?</h2>
<p>We develop a new theory (some parts rigorously proved and others supported by experiments) suggesting that <strong>LR doesn’t play the role assumed in most discussions.</strong></p>
<p>It’s widely believed that LR $\eta$ controls the convergence rate of SGD and affects the generalization via changing the magnitude of noise because LR $\eta$ adjusts the magnitude of gradient update per step.
<!--It's also worth noting that for vanilla SGD, changing LR is equivalent to rescaling the loss function. -->
However, for normalized networks trained with SGD + WD, the effect of LR is more subtle as now it has two roles: (1). the multiplier before the gradient of the loss. (2). the multiplier before WD. Intuitively, one imagines the WD part is useless since the loss function is scale-invariant, and thus the first role must be more important. But surprisingly, this intuition is completely wrong and it turns out that the second role is way more important than the first one.
Further analysis shows that a better measure of speed of learning is $\eta \lambda$, which we call the <em>intrinsic learning rate</em> or <em>intrinsic LR</em>, denoted $\lambda_e$.</p>
<p>While previous papers have noticed qualitatively that LR and WD have a close interaction, our ExpLR paper <a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>) gave mathematical proof that <em>if WD* LR, i.e., $\lambda\eta$ is fixed, then the effect of changing LR on the dynamics is equivalent to rescaling the initial parameters</em>. As far as we can tell, performance of SGD on modern architectures is quite robust to (indeed usually independent of) scale of the initialization, so the effect of changing initial LR while keeping intrinsic LR fixed is also negligible.</p>
<p>Our paper gives insight into the role of intrinsic LR $\lambda_e$ by giving a new SDE-style analysis of SGD for normalized nets, leading to the following conclusion (which rests in part on experiments):</p>
<blockquote>
<p>In normalized nets SGD does indeed lead to rapid mixing, but in <strong>function space</strong> (i.e., input-output behavior of the net). Mixing happens after $O(1/\lambda_e)$ iterations, in contrast to the exponentially slow mixing guaranteed in the parameter space by traditional analysis of diffusion walks.</p>
</blockquote>
<p>To explain the meaning of mixing in function space, let’s view SGD (carried out for a fixed number of steps) as a way to sample a trained net from a distribution over trained nets. Thus the end result of SGD from a fixed initialization can be viewed as a probabilistic classifier whose output on any datapoint is the $K$-dimenstional vector whose $i$th coordinate is the probability of outputting label $i$. (Here $K$ is the total number of labels.) Now if two different initializations both cause SGD to produce classifiers with error $5$ percent on heldout datapoints, then <em>a priori</em> one would imagine that on a given held-out datapoint the classifier from the first distribution <strong>disagrees</strong> with the classifier from the second distribution with roughly $2 * 5 =10$ percent probability. (More precisely, $2 * 5 * (1-0.05) = 9.5$ percent.) However, convergence to an equilibrium distribution in function space means that the probability of disagreement is almost $0$, i.e., the distribution is almost the same regardless of the initialization! This is indeed what we experimentally find, to our big surprise. Our theory is built around this new phenomenon.</p>
<div style="text-align:center;">
<img style="width:500px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/conjecture.png" />
</div>
<p><strong>Figure 4</strong>: A simple 4-layer normalized CNN trained on MNIST with three schedules converge to the same equilibrium after intrinsic LRs become equal at epoch 81. We use Monte Carlo ($500$ trials) to estimate $\ell_1$ distances between distributions.</p>
<p>In the next post, we will explain our new theory and the partial new analysis of SDEs arising from SGD in normalized nets.</p>
Wed, 21 Oct 2020 15:00:00 -0700
http://offconvex.github.io/2020/10/21/intrinsicLR/
http://offconvex.github.io/2020/10/21/intrinsicLR/Beyond log-concave sampling<p>As the growing number of posts on this blog would suggest, recent years have seen a lot of progress in understanding optimization beyond convexity. However, optimization is only one of the basic algorithmic primitives in machine learning — it’s used by most forms of risk minimization and model fitting. Another important primitive is sampling, which is used by most forms of inference (i.e. answering probabilistic queries of a learned model).</p>
<p>It turns out that there is a natural analogue of convexity for sampling — <em>log-concavity</em>. Paralleling the state of affairs in optimization, we have a variety of (provably efficient) algorithms for sampling from log-concave distributions, under a variety of access models to the distribution. Log-concavity, however, is very restrictive and cannot model common properties of distributions we frequently wish to sample from in machine learning applications, for example multi-modality and manifold structure in the level sets, which is what we’ll focus on in this and the upcoming post.</p>
<p>Unlike non-convex optimization, the field of sampling beyond log-concavity is very nascent. In this post, we will survey the basic tools and difficulties for sampling beyond log-concavity. In the next post, we will survey recent progress in this direction, in particular with respect to handling multi-modality and manifold structure in the level sets, covering the papers <a href="https://arxiv.org/abs/1812.00793">Simulated tempering Langevin Monte Carlo</a> by Rong Ge, Holden Lee, and Andrej Risteski and <a href="https://arxiv.org/abs/2002.05576">Fast convergence for Langevin diffusion with matrix manifold structure</a> by Ankur Moitra and Andrej Risteski.</p>
<h1 id="formalizing-the-sampling-problem">Formalizing the sampling problem</h1>
<p>The formulation of the sampling problem we will consider is as follows:</p>
<blockquote>
<p><strong>Problem</strong>: Sample from a distribution $p(x) \propto e^{-f(x)}$ given black-box access to $f$ and $\nabla f$.</p>
</blockquote>
<p>This formalization subsumes a lot of inference tasks involving different kinds of probabilistic models. We give several common examples:</p>
<p><em>1.Posterior inference</em>: Suppose our data is generated from a model with <em>unknown</em> parameters $\theta$ , such that the data-generation process is given by $p(x \mid \theta)$ and we have a prior $p(\theta)$ over the model parameters. Then the <em>posterior distribution</em> $p(\theta \mid x)$ , by Bayes’s Rule, is given by</p>
\[p(\theta \mid x) = \frac{p(x \mid \theta)p(x)}{p(x)}\propto p(x \mid \theta)p(\theta).\]
<p>A canonical example of this is a <em>noisy inference task</em> where a signal (parametrized by $\theta$ ) is perturbed by noise (as specified by $p(x \mid \theta)$ ).</p>
<p><em>2.Posteriors in latent-variable models</em>: If the data-generation process has a <em>latent (hidden) variable</em> $h$ associated to each data point, such that $h$ has a <em>known</em> prior $p(h)$ and a <em>known</em> conditional $p_\theta(x \mid h)$ , then again by Bayes’s rule, we have</p>
\[p_\theta(h \mid x) = \frac{p_\theta(x \mid h)p_\theta(h)}{p_\theta(x)}\propto p_\theta(x \mid h)p_\theta(h).\]
<p>In typical latent-variable models, $p_\theta(x \mid h)$ and $p_\theta(h)$ have a simple parametric form, which makes it easy to evaluate $p_\theta(x \mid h)p_\theta(h)$ . Some examples of latent-variable models are mixture models (where $h$ encodes which component a sample came from), topic models (where $h$ denote the topic proportions in a document), and noisy-OR networks (and latent-variable Bayesian belief networks).</p>
<p><em>3.Sampling from energy models</em>: in energy models, the distribution of the data is parametrized as $p(x) \propto \exp(-E(x))$ for some <em>energy</em> function $E(x)$ which is smaller on points in the data distribution. Recent works by <a href="https://arxiv.org/abs/1907.05600">(Song, Ermon 2019)</a> and <a href="https://arxiv.org/abs/1903.08689">(Du, Mordatch 2019)</a> have scaled up the training of these models on images so that the visual quality of the samples they produce is comparable to that of more popular generative models like GANs and flow models.</p>
<p>The “exponential form” $e^{-f(x)}$ is also helpful in making an analogy to optimization. Namely, if we sample from $p(x)\propto e^{-f(x)}$, a particular point $x$ is more likely to be sampled if $f(x)$ is small. The key difference between with optimization is that while in optimization, we only want to get to the minimum, in sampling, we want to pick points with the correct probabilities.</p>
<h1 id="comparison-with-optimization">Comparison with optimization</h1>
<p>The computational hardness landscape for our sampling problem parallels the one for black-box optimization, in which the goal is to find the minimum of a function $f$, given value/gradient oracle access. When $f$ is <em>convex</em>, there is a unique local minimum, so that local search algorithms like <em>gradient descent</em> are efficient. When $f$ is non-convex, gradient descent can get trapped in potentially poor local minima, and in the worst case, an exponential number of queries is needed.</p>
<p>Similarly, for sampling, when $p$ is <em>log-concave</em>, the distribution is unimodal and a Markov Chain which is a close relative of gradient descent — <em>Langevin Monte Carlo</em> — is efficient. When $p$ is non-log-concave, Langevin Monte Carlo can get trapped in one of many modes, and and exponential number of queries may also be needed.</p>
<blockquote>
<p>A distribution $p(x)\propto e^{-f(x)}$ is <strong>log-concave</strong> if $f(x) = -\log p(x)$ is convex. It is $\alpha$-strongly log-concave if $f(x)$ is $\alpha$-strongly convex.</p>
</blockquote>
<p>However, such worst-case hardness rarely stop practitioners from trying to solve the non-convex optimization or non-log-concave sampling problems which are ubiquitous in modern machine learning. Often they manage to do so with great success - for instance, in training deep neural networks, gradient descent and its relatives perform quite well. Similarly, Langevin Monte Carlo and its relatives can do quite well on non-log-concave problems, though they sometimes need to be aided by temperature heuristics and other tricks.</p>
<p>As theorists, we’d like to develop theory that will lead to a better understanding of why and when these heuristics work. Just like we’ve done for optimization, we need to be guided both by hardness results and relevant structure of real-world problems in this endeavour.</p>
<p>The following table summarizes the comparisons we have come up with:</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/table_opt.jpg" alt="" /></p>
<p>Before we move on to non-log-concave distributions, though, we need to understand the basic algorithm for sampling and its guarantees for log-concave distributions.</p>
<h1 id="langevin-monte-carlo">Langevin Monte Carlo</h1>
<p>Just as gradient descent is the canonical algorithm for optimization, <em>Langevin Monte Carlo</em> (LMC) is the canonical algorithm for our sampling problem. In a nutshell, it is gradient descent that also injects Gaussian noise:</p>
\[\text{Gradient descent:}\quad
x_{t+\eta} = x_t - \eta \nabla f(x_t)\]
\[\text{Langevin Monte Carlo:}\quad
x_{t+\eta} = x_t - \eta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I)\]
<p>Both of these processes can be considered as discretizations of a continuous process. For gradient descent, the limit is an <em>ordinary differential equation</em>, and for Langevin Monte Carlo a <em>stochastic differential equation</em>:</p>
\[\text{Gradient flow:} \quad dx_t = -\nabla f(x_t) dt\]
\[\text{Langevin diffusion:} \quad dx_t = -\nabla f(x_t) dt + \sqrt{2} dB_t\]
<p>where $B_t$ denotes Brownian motion of the appropriate dimension.</p>
<p>The crucial property of the above stochastic differential equation is that under fairly mild assumptions on $f$, the stationary distribution is $p(x) \propto e^{-f(x)}$. (If you’re more comfortable with optimization, note that while gradient descent generally converges to (local) minima, the Gaussian noise term prevents LMC from converging to a single point - rather, it converges to a <em>stationary distribution</em>. See animation below.)</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/gd_ld_animated.gif" alt="" /></p>
<p>Langevin Monte Carlo fits in the <em>Markov Chain Monte Carlo</em> (MCMC) paradigm: design a random walk, so that the stationary distribution is the desired distribution. “Mixing” means getting close to the stationary distribution, and rapid mixing means this happens quickly.</p>
<p>Like in optimization, Langevin Monte Carlo is the most “basic” algorithm: for example, one can incorporate “acceleration” and obtain <em>underdamped</em> Langevin, or use the physics-inspired Hamiltonian Monte Carlo.</p>
<h1 id="tools-for-bounding-mixing-time-challenges-beyond-log-concavity">Tools for bounding mixing time, challenges beyond log-concavity</h1>
<p>To illustrate the difficulty in moving beyond log-concavity, we’ll describe the tools that are used to prove fast mixing for log-concave distributions, and where they fall short for non-log-concave distributions.</p>
<p>We will do this by an analogy to how we analyze random walks on graphs. One common way to prove rapid mixing of a random walk on a graph is to show the Laplacian has a spectral gap (equivalently, the transition matrix has a gap between the largest and next-to-largest eigenvalue). The analogue of this for Langevin diffusion is showing a <em>Poincaré inequality</em>. (A spectral gap of $1/C$ corresponds to Poincaré constant of $C$.)</p>
<blockquote>
<p>We say that $p(x)$ satisfies a <strong>Poincaré inequality</strong> with constant $C$ if for all functions $g$ on $\mathbb R^d$ (such that $g$ and $\nabla g$ are square-integrable with respect to $p$),</p>
<div> $$\text{Var}_p(g) \le C \int_{\mathbb R^d} ||\nabla g(x)||^2 p(x)\,dx.$$ </div>
</blockquote>
<p>A small constant $C$ implies fast mixing in $\chi^2$ divergence, which implies fast mixing in total variation distance. More precisely, the mixing time for Langevin diffusion is on the order of $C$. We note that other functional inequalities imply mixing with respect to other measures (such as log-Sobolev inequalities for KL divergence).</p>
<p>While it may not be obvious what the Poincaré inequality has to do with a spectral gap, it turns out that we can think of the right-hand side as a quadratic form involving the <em>infinitesimal generator</em> of Langevin process, which functions as the continuous analogue of a Laplacian for a graph random walk.</p>
<p>The following table shows the analogy: we can put the discrete and continuous processes on the same footing by defining a quadratic form called the Dirichlet form from the Laplacian or infinitesimal generator.</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/table_mixing.jpg" alt="" /></p>
<p>To see how the Poincaré inequality represents a spectral gap in the discrete case, we write it in a more explicit form in a familiar special case: a lazy random walk (i.e. a random walk that with probability $1/2$ stays in the current vertex, and with probability $1/2$ goes to a random neighbor) on a regular graph with $n$ vertices. In this case, $p$ is the uniform distribution, and $v_1=\mathbf 1,\ldots, v_n$ are the eigenvectors of $A$ with eigenvalues $1=\lambda_1\ge \lambda_2\ge \cdots \ge \lambda_n\ge 0$; normalize $v_1,\ldots, v_n$ so they have unit norm with respect to $p$, i.e. $\Vert v_i\Vert_p^2=\frac 1n\sum_j v_{ij}^2=1$.</p>
<p>Writing $g= \sum_i a_i v_i$, since $v_2,\ldots, v_n$ are orthogonal to $v_1=\mathbf 1$, we have $\langle g, \mathbf 1\rangle_p = a_1$, so</p>
\[\text{Var}_p(g) = \frac{1}{n}(\sum_i g_i^2) - a_1^2 = \sum_{i=2}^n a_i^2\]
<p>Furthermore, we have</p>
\[\langle g, Lg \rangle_p = \langle \sum_i a_iv_i, (I- A)(\sum_i a_iv_i)\rangle_p= \sum_{i=2}^n a_i^2(1-\lambda_i)\]
<p>These coefficients are all at most $1-\lambda_2$, i.e. the <em>spectral gap</em>, so</p>
\[\langle g, Lg \rangle_p \ge (1-\lambda_2)\text{Var}_p(g),\]
<p>which shows the Poincaré inequality with constant $(1-\lambda_2)^{-1}$.</p>
<p>A classic theorem establishes a Poincaré inequality for (strongly) log-concave distributions.</p>
<blockquote>
<p><strong>Theorem (Bakry, Emery 1985)</strong>: If $p(x)$ is $\alpha$-strongly log-concave, then $p(x)$ satisfies a Poincaré inequality with constant $\frac1{\alpha}$.</p>
</blockquote>
<p>Hence, for strongly-log-concave distributions, Langevin diffusion mixes rapidly. To complete the picture, a line of recent works, starting with <a href="https://arxiv.org/abs/1412.7392">(Dalalyan 2014)</a> have established bounds for discretization error to obtain algorithmic guarantees for Langevin Monte Carlo.</p>
<p>However, guarantees break down when we don’t assume log-concavity. Generically, algorithms for sampling depend <em>exponentially</em> on the ambient dimension $d$, or on the “size” of the non-log-concave region (e.g., the distance between modes of the distribution). In terms of their dependence on $d$, they are not doing much better than if we split space into cells and sample each according to its probability, similar to “grid search” for optimization. This is unsurprising: we can’t hope for better guarantees without structural assumptions.</p>
<p>Toward this end, in the next blog post we will consider two kinds of structure that allow efficient sampling:</p>
<ol>
<li>Simple multimodal distributions, such as a mixture of gaussians with equal variance.</li>
<li>Manifold structure, arising from symmetries in the level sets of the distribution.</li>
</ol>
Sat, 19 Sep 2020 07:00:00 -0700
http://offconvex.github.io/2020/09/19/beyondlogconvavesampling/
http://offconvex.github.io/2020/09/19/beyondlogconvavesampling/Training GANs - From Theory to Practice<p>GANs, originally discovered in the context of unsupervised learning, have had far reaching implications to science, engineering, and society. However, training GANs remains challenging (in part) due to the lack of convergent algorithms for nonconvex-nonconcave min-max optimization. In this post, we present a <a href="https://arxiv.org/abs/2006.12376">new first-order algorithm</a> for min-max optimization which is particularly suited to GANs. This algorithm is guaranteed to converge to an equilibrium, is competitive in terms of time and memory with gradient descent-ascent and, most importantly, GANs trained using it seem to be stable.</p>
<h2 id="gans-and-min-max-optimization">GANs and min-max optimization</h2>
<p>Starting with the work of <a href="http://papers.nips.cc/paper/5423-generative-adversarial-nets">Goodfellow et al.</a>, Generative Adversarial Nets (GANs) have become a critical component in various ML systems; for prior posts on GANs, see <a href="https://www.offconvex.org/2018/03/12/bigan/">here</a> for a post on GAN architecture, and <a href="https://www.offconvex.org/2017/03/15/GANs/">here</a> and <a href="https://www.offconvex.org/2017/07/06/GANs3/">here</a> for posts which discuss some of the many difficulties arising when training GANs.</p>
<p>Mathematically, a GAN consists of a generator neural network $\mathcal{G}$ and a discriminator neural network $\mathcal{D}$ that are competing against each other in a way that, together, they learn the unknown distribution from which a given dataset arises. The generator takes a random “noise” vector as input and maps this vector to a sample; for instance, an image. The discriminator takes samples – “fake” ones produced by the generator and “real” ones from the given dataset – as inputs. The discriminator then tries to classify these samples as “real” or “fake”. As a designer, we would like the generated samples to be indistinguishable from those of the dataset. Thus, our goal is to choose weights $x$ for the generator network that allow it to generate samples which are difficult for <em>any</em> discriminator to tell apart from real samples. This leads to a min-max optimization problem where we look for weights $x$ which <em>minimize</em> the rate (measured by a loss function $f$) at which any discriminator correctly classifies the real and fake samples. And, we seek weights $y$ for the discriminator network which <em>maximize</em> this rate.</p>
<blockquote>
<p><strong>Min-max formulation of GANs</strong> <br /> <br /></p>
\[\min_x \max_y f(x,y),\]
\[f(x,y) := \mathbb{E}[ f_{\zeta, \xi}(x,y)],\]
<p>where $\zeta$ is a random sample from the dataset, and $\xi \sim N(0,I_d)$ is a noise vector which the generator maps to a “fake” sample. $f_{\zeta, \xi}$ measures how accurately the discriminator $\mathcal{D}(y;\cdot)$ distinguishes $\zeta$ from $\mathcal{G}(x;\xi)$ produced by the generator using the input noise $\xi$.</p>
</blockquote>
<p>In this formulation, there are several choices that we have to make as a GAN designer, and an important one is that of a loss function. One concrete choice is from the paper of Goodfellow et al.: the cross-entropy loss function:</p>
\[f_{\zeta, \xi}(x,y) := \log(\mathcal{D}(y;\zeta)) + \log(1-\mathcal{D}(y;\mathcal{G}(x;\xi)))\]
<p>See <a href="https://machinelearningmastery.com/generative-adversarial-network-loss-functions/">here</a> for a summary and comparison of different loss functions.</p>
<p>Once we fix the loss function (and the architecture of the generator and discriminator), we can compute unbiased estimates of the value of $f$ and its gradients $\nabla_x f$ and $\nabla_y f$ using batches consisting of random Gaussian noise vectors $\xi_1,\ldots, \xi_n \sim N(0,I_d)$ and random samples from the dataset $\zeta_1, \ldots, \zeta_n$. For example, the stochastic batch gradient</p>
\[\frac{1}{n} \sum_{i=1}^n \nabla_x f_{\zeta_i, \xi_i}(x,y)\]
<p>gives us an unbiased estimate for $\nabla_x f(x,y)$.</p>
<blockquote>
<p>But how do we solve the min-max optimization problem above using such a first-order access to $f$?</p>
</blockquote>
<h2 id="gradient-descent-ascent-and-variants">Gradient descent-ascent and variants</h2>
<p>Perhaps the simplest algorithm we can try for min-max optimization is gradient descent-ascent (GDA). As the generator wants to minimize with respect to $x$ and the discriminator wants to maximize with respect to $y$, the idea is to do descent steps for $x$ and ascent steps for $y$. How exactly to do this is not clear, and one strategy is to let the generator and discriminator alternate:</p>
\[x_{i+1} = x_i -\nabla_x f(x_i,y_i),\]
\[y_{i+1} = y_i +\nabla_y f(x_i,y_i).\]
<p>Other variants include, for instance, <a href="https://arxiv.org/abs/1311.1869">optimistic mirror descent</a> (OMD) (see also <a href="https://arxiv.org/abs/1807.02629">here</a> and <a href="https://arxiv.org/abs/1711.00141">here</a> for applications of OMD to GANs, and <a href="https://arxiv.org/abs/1901.08511">here</a> for an analysis of OMD and related methods)</p>
\[x_{i+1} = x_i -2\nabla_x f(x_i,y_i) + \nabla_x f(x_{i-1},y_{i-1})\]
\[y_{i+1} = y_i +2\nabla_y f(x_i,y_i)- \nabla_y f(x_{i-1},y_{i-1}).\]
<p>The advantage of such algorithms is that they are quite practical. The problem, as we discuss next, is that they are not always guaranteed to converge. Most of these guarantees only hold for special classes of loss functions $f$ that satisfy properties such as concavity (see <a href="https://papers.nips.cc/paper/9430-efficient-algorithms-for-smooth-minimax-optimization.pdf">here</a> and <a href="https://arxiv.org/abs/1906.00331">here</a>) or <a href="https://papers.nips.cc/paper/9631-solving-a-class-of-non-convex-min-max-games-using-iterative-first-order-methods.pdf">monotonicity</a>, or under the assumptions that these algorithms are provided with special starting points (see <a href="https://arxiv.org/abs/1706.08500">here</a>, <a href="https://arxiv.org/abs/1910.07512">here</a>).</p>
<h2 id="convergence-problems-with-current-algorithms">Convergence problems with current algorithms</h2>
<p>Unfortunately there are simple functions for which some min-max optimization algorithms may never converge to <em>any</em> point. For instance GDA may not converge on $f(x,y) = xy$ (see Figure 1, and our <a href="https://www.offconvex.org/2020/06/24/equilibrium-min-max/">previous post</a> for a more detailed discussion).</p>
<div>
<img src="/assets/GDA_spiral_2.gif" alt="" />
<br />
<b>Figure 1.</b> GDA on $f(x,y) = xy, \, \, \, \, x,y \in [-5,5]$ (the red line is the set of global min-max points). GDA is non-convergent from almost every initial point.
</div>
<p><br /></p>
<p>As for examples relevant to ML, when using GDA to train a GAN on a dataset consisting of points sampled from a mixture of four Gaussians in $\mathbb{R}^2$, we observe that GDA tends to cause the generator to cycle between different modes corresponding to the four Gaussians. We also used GDA to train a GAN on the subset of the MNIST digits which have “0” or “1” as their label, which we refer to as the 0-1 MNIST dataset. We observed a cycling behavior for this dataset as well: After learning how to generate images of $0$’s, the GAN trained by GDA then forgets how to generate $0$’s for a long time and only generates $1$’s.</p>
<div>
<img style="width:400px;" src="/assets/GDA_Gaussian.gif" alt="" />
<img style="width:400px;" src="/assets/GDA_MNIST.gif" alt="" />
<br />
<b>Figure 2.</b> Mode oscillation when GDA is used to train GANs on the four Guassian mixture dataset (left) and the 0-1 MNIST dataset (right).
</div>
<p><br /></p>
<p>In algorithms such as GDA where the discriminator only makes local updates, cycling can happen for the following reason: Once the discriminator learns to identify one of the modes (say mode “A”), the generator can update $x$ in a way that greatly decreases f, by (at least temporarily) ìfoolingî the discriminator. The generator does this by learning to generate samples from a different mode (say mode “B”) which the discriminator has not yet learned to identify, and stops generating samples from mode A. However, after many iterations, the discriminator ìcatches upî to the generator and learns how to identify mode B. Since the generator is no longer generating samples from mode A, the discriminator may then ìforgetî how to identify samples from this mode. And this can cause the generator to switch back to generating only mode A.</p>
<h2 id="our-first-order-algorithm">Our first-order algorithm</h2>
<p>To solve the min-max optimization problem, at any point $(x,y)$, we should ideally allow the discriminator to find the global maximum, $\max_z f(x,z)$. However, this may be hard for nonconcave $f$. But we could still let the discriminator run a convergent algorithm (such as gradient ascent) until it reaches a <strong>first-order stationary point</strong>, allowing it to compute an approximation $h$ for the global max function. (Note that even though $\max_z f(x,z)$ is only a function of $x$, since $h$ is a “local’’ approximation it could also depend on the initial point $y$ where we start gradient ascent.) And we also empower the generator to simulate the discriminator’s update by running gradient ascent (see <a href="https://arxiv.org/abs/2006.12376">our paper</a> for discriminators with access to a more general class of first-order algorithms).</p>
<blockquote>
<p><strong>Idea 1: Use a local approximation to global max</strong>
<br /><br />
Starting at the point $(x,y)$, update $y$ by computing multiple gradient ascent steps for $y$ until a point $w$ is reached where \(\|\nabla_y f(x,w)\|\) is close to zero and define $h(x,y) := f(x,w)$.</p>
</blockquote>
<p>We would like the generator to minimize $h(\cdot,y)$. To minimize $h$, we would ideally like to update $x$ in the direction $-\nabla_x h$. However, $h$ may be discontinuous in $x$ (see our <a href="https://www.offconvex.org/2020/06/24/equilibrium-min-max/">previous post</a> for why this can happen). Moreover, even at points where $h$ is differentiable, computing the gradient of $h$ can take a long time and requires a large amount of memory.</p>
<p>Thus, realistically, we only have access to the value of $h$. A naive approach to minimizing $h$ would be to propose a random update to $x$, for instance an update sampled from a standard Gaussian, and then only accept this update if it causes the value of $h$ to decrease. Unfortunately, this does not lead to fast algorithms as even at points where $h$ is differentiable, in high dimensions, a random Gaussian step will be almost orthogonal to the steepest descent direction $-\nabla_x h(x,y)$, making the progress slow.</p>
<p>Another idea is to have the generator propose at each iteration an update in the direction of the gradient $-\nabla_x f(x,y)$, and to then have the discriminator update $y$ using gradient ascent. To see why this may be a reasonable thing to do, notice that once the generator proposes an update $v$ to $x$, the discriminator will only make updates which increase the value of f or, $h(x+v,y) \geq f(x+v,y)$. And, since $y$ is a first-order stationary point for $f(x, \cdot)$ (because $y$ was computed using gradient ascent in the <em>previous</em> iteration), we also have that $h(x,y)=f(x,y)$. Hence,</p>
\[f(x+v,y) \leq h(x+v,y) < h(x,y) = f(x,y).\]
<p><em>This means that decreasing $h$ requires us to decrease $f$ (the converse is not true). So it indeed makes sense to move in the direction $-\nabla_x f(x,y)$!</em></p>
<p>While making updates using $-\nabla_x f(x,y)$ may allow the generator to decrease $h$ more quickly than updating in a random direction, it is not always the case that updating in the direction of $-\nabla_x f$ will lead to a decrease in $h$ (and doing so may even lead to an increase in $h$!). Instead, our algorithm has the generator perform a random search by proposing an update in the direction of a batch gradient with mean $-\nabla_x f$, and accepts this move only if the value of $h$ (the local approximation) decreases. The accept-reject step prevents our algorithm from cycling between modes, and using the batch gradient for the random search allows our algorithm to be competitive with prior first-order methods in terms of running time.</p>
<blockquote>
<p><strong>Idea 2: Use zeroth-order optimization with batch gradients</strong>
<br /><br />
Sample a batch gradient $v$ with mean $-\nabla_x f(x,y)$.
<br />
If $h(x+ v, y) < h(x,y) $ accept the step $x+v$; otherwise reject it.</p>
</blockquote>
<p>A final issue, that applies even in the special case of minimization, is that converging to a <em>local</em> minimum point does not mean that point is desirable from an application standpoint. The same is true for the more general setting of min-max optimization. To help our algorithm escape undesirable local min-max equilibria, we use a randomized accept-reject rule inspired by <a href="https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7">simulated annealing</a>. Simulated annealing algorithms seek to minimize a function via a randomized search, while gradually decreasing the acceptance probability of this search; in some cases this allows one to reach the global minimum of a nonconvex function (see for instance <a href="https://arxiv.org/abs/1711.02621">this paper</a>). These three ideas lead us to our algorithm.</p>
<blockquote>
<p><strong>Our algorithm</strong>
<br /><br />
<em>Input</em>: Initial point $(x,y)$, $f: \mathbb{R}^d \times \mathbb{R}^d\rightarrow \mathbb{R}$
<br />
<em>Output:</em> A local min-max equilibrium $(x,y)$</p>
<p><br /> <br /></p>
<p>For $i = 1,2, \ldots$ <br />
<br />
<strong>Step 1:</strong> Generate a batch gradient $v$ with mean $-\nabla_x f(x,y)$ and propose the generator update $x+v$.
<br /><br />
<strong>Step 2:</strong> Compute $h(x+v, y) = f(x+v, w)$, by simulating a discriminator update $w$ via gradient ascent on $f(x+v, \cdot)$ starting at $y$.
<br /><br />
<strong>Step 3:</strong> If $h(x+v, y)$ is less than $h(x,y) = f(x,y)$, accept both updates: $(x,y) = (x+v, w)$. Else, accept both updates with some small probability.</p>
</blockquote>
<p>In our paper, we show that our algorithm is guaranteed to converge to a type of local min-max equilibrium in $\mathrm{poly}(\frac{1}{\varepsilon},d, b, L)$ time whenever $f$ is bounded by some $b>0$ and has $L$-Lipschitz gradients. Our algorithm does not require any special starting points, or any additional assumptions on $f$ such as convexity or monotonicity. (See Definition 3.2 and Theorem 3.3 in our paper.)</p>
<div>
<img style="width:400px;" src="/assets/GDA_spiral_2.gif" alt="" />
<img style="width:400px;" src="/assets/OurAlgorithm_surface_run1.gif" alt="" />
<br />
<b>Figure 3.</b> GDA (left) and a version of our algorithm (right) on $f(x,y) = xy, \, \, \, \, x,y \in [-5,5]$. While GDA is non-convergent from almost every initial point, our algorithm converges to the set of global min-max points (the red line). To ensure it converges to a (local) equilibrium, our algorithm's generator proposes multiple updates, simulates the discriminator's response, and rejects updates which do not lead to a net decrease in $f$. It only stops if it can't find such an update after many attempts. (To stay inside $[-5,5]\times [-5,5]$ this version of our algorithm uses <i>projected</i> gradients.)
</div>
<p><br /></p>
<h2 id="so-how-does-our-algorithm-perform-in-practice">So, how does our algorithm perform in practice?</h2>
<p>When training a GAN on the mixture of four Gaussians dataset, we found that our algorithm avoids the cycling behavior observed in GDA. We ran each algorithm multiple times, and evaluated the results visually. By the 1500’th iteration GDA learned only one mode in 100% of the runs, and tended to cycle between two or more modes. In contrast, our algorithm was able to learn all four modes 68% of the runs, and three modes 26% of the runs.</p>
<div>
<img src="/assets/Both_algorithms_Gaussian.gif" alt="" />
<br />
<b>Figure 4.</b> GAN trained using GDA and our algorithm on a four Gaussian mixture dataset. While GDA cycles between the Gaussian modes (red dots), our algorithm learns all four modes.
</div>
<p><br /></p>
<p>When training on the 0-1 MNIST dataset, we found that GDA tends to briefly generate shapes that look like a combination of $0$’s and $1$’s, then switches to generating only $1$’s, and then re-learns how to generate $0$’s. In contrast, our algorithm seems to learn how to generate both $0$’s and $1$’s early on and does not stop generating either digit. We repeated this simulation multiple times for both algorithms, and visually inspected the images at the 1000’th iteration. GANs trained using our algorithm generated both digits by the 1000’th iteration in 86% of the runs, while those trained using GDA only did so in 23% of the runs.</p>
<div>
<img src="/assets/MNIST_bothAlgorithms.gif" alt="" />
<br />
<b>Figure 5.</b> We trained a GAN with GDA and our algorithm on the
0-1 MNIST dataset. During the first 1000 iterations, GDA (left)
forgets how to generate $0$'s, while our algorithm (right) learns how to
generate both $0$'s and $1$'s early on and does not stop generating either digit.
</div>
<p><br /></p>
<p>While here we have focused on comparing our algorithm to GDA, in our paper we also include a comparison to <a href="https://arxiv.org/abs/1611.02163">Unrolled GANs</a>, which exhibits cycling between modes. We also present results for CIFAR-10 (see Figures 3 and 7 in our paper), where we compute FID scores to track the progress of our algorithm. See our paper for more details; the code is available on <a href="https://github.com/mangoubi/Min-max-optimization-algorithm-for-training-GANs">GitHub</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we have shown how to develop a practical and convergent first-order algorithm for training GANs. Our algorithm synthesizes an approximation to the global max function based on first-order algorithms, random search using batch gradients, and simulated annealing. Our simulations show that a version of this algorithm can lead to more stable training of GANs. And yet the amount of memory and time required by each iteration of our algorithm is competitive with GDA. This post, together with the <a href="https://www.offconvex.org/2020/06/24/equilibrium-min-max/">previous post</a>, show that different local approximations to the global max function $\max_z f(x,z)$ can lead to different types of convergent algorithms for min-max optimization. We believe that this idea should be useful in other applications of min-max optimization.</p>
Mon, 06 Jul 2020 02:00:00 -0700
http://offconvex.github.io/2020/07/06/GAN-min-max/
http://offconvex.github.io/2020/07/06/GAN-min-max/An equilibrium in nonconvex-nonconcave min-max optimization<p>While there has been incredible progress in convex and nonconvex minimization, a multitude of problems in ML today are in need of efficient algorithms to solve min-max optimization problems.
Unlike minimization, where algorithms can always be shown to converge to some local minimum, there is no notion of a local equilibrium in min-max optimization that exists for general nonconvex-nonconcave functions.
In two recent papers, we give two notions of local equilibria that are guaranteed to exist and efficient algorithms to compute them.
In this post we present the key ideas behind a second-order notion of local min-max equilibrium from <a href="https://arxiv.org/abs/2006.12363">this paper</a> and in the next we will talk about a different notion along with the algorithm and show its implications to GANs from <a href="https://arxiv.org/abs/2006.12376">this paper</a>.</p>
<h2 id="min-max-optimization">Min-max optimization</h2>
<p>Min-max optimization of an objective function $f:\mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$</p>
\[\min_x \max_y f(x,y)\]
<p>is a powerful framework in optimization, economics, and ML as it allows one to model learning in the presence of multiple agents with competing objectives.
In ML applications, such as <a href="https://arxiv.org/abs/1406.2661">GANs</a> and <a href="https://adversarial-ml-tutorial.org">adversarial robustness</a>, the min-max objective function may be nonconvex-nonconcave.
We know that min-max optimization is at least as hard as minimization, hence, we cannot hope to find a globally optimal solution to min-max problems for general functions.</p>
<h2 id="approximate-local-minima-for-minimization">Approximate local minima for minimization</h2>
<p>Let us first revisit the special case of minimization, where there is a natural notion of an approximate second-order local minimum.</p>
<blockquote>
<p>$x$ is a second-order $\varepsilon$-local minimum of $\mathcal{L}:\mathbb{R}^d\rightarrow \mathbb{R}$ if
\(\|\nabla \mathcal{L}(x)\| \leq \varepsilon \ \ \mathrm{and} \ \ \nabla^2 \mathcal{L}(x) \succeq -\sqrt{\varepsilon}.\)</p>
</blockquote>
<p>Now suppose we just wanted to minimize a function $\mathcal{L}$, and we start from any point which is <em>not</em> at an $\varepsilon$-local minimum of $\mathcal{L}$.
Then we can always find a direction to travel in along which either $\mathcal{L}$ decreases rapidly, or the second derivative of $\mathcal{L}$ is large.
By searching in such a direction we can easily find a new point which has a smaller value of $\mathcal{L}$ using only local information about the gradient and Hessian of $\mathcal{L}$.
This means that we can keep decreasing $\mathcal{L}$ until we reach an $\varepsilon$-local minimum (see <a href="https://www.researchgate.net/profile/Boris_Polyak2/publication/220589612_Cubic_regularization_of_Newton_method_and_its_global_performance/links/09e4150dd2f0320879000000/Cubic-regularization-of-Newton-method-and-its-global-performance.pdf">Nesterov and Polyak</a>, <a href="https://dl.acm.org/doi/10.1145/3055399.3055464">here</a>, <a href="http://proceedings.mlr.press/v40/Ge15.pdf">here</a>, and also an earlier <a href="https://www.offconvex.org/2016/03/22/saddlepoints">blog post</a> for how to do this with only access to gradients of $\mathcal{L}$).
If $\mathcal{L}$ is Lipschitz smooth and bounded, we will reach an $\varepsilon$-local minimum in polynomial time from any starting point.</p>
<blockquote>
<p>Is there an analogous definition with similar properties for min-max optimization?</p>
</blockquote>
<h2 id="problems-with-current-local-optimality-notions">Problems with current local optimality notions</h2>
<p>There has been much recent work on extending theoretical results in nonconvex minimization to min-max optimization (see <a href="https://arxiv.org/abs/1906.00331">here</a>, <a href="https://papers.nips.cc/paper/9430-efficient-algorithms-for-smooth-minimax-optimization">here</a>, <a href="https://arxiv.org/pdf/1807.02629.pdf">here</a>, <a href="https://papers.nips.cc/paper/9631-solving-a-class-of-non-convex-min-max-games-using-iterative-first-order-methods.pdf">here</a>, <a href="https://arxiv.org/abs/1910.07512">here</a>.
One way to extend the notion of local minimum to the min-max setting is to seek a solution point called a “local saddle”–a point $(x,y)$ where 1) $y$ is a local maximum for $f(x, \cdot)$ and 2) $x$ is a local minimum for $f(\cdot, y).$</p>
<p>For instance,
this is used <a href="https://arxiv.org/abs/1706.08500">here</a>, <a href="https://arxiv.org/pdf/1901.00838.pdf">here</a>, <a href="https://arxiv.org/pdf/1705.10461.pdf">here</a>, and <a href="http://proceedings.mlr.press/v89/adolphs19a.html">here</a>.
But, there are very simple examples of two-dimensional bounded functions where a local saddle does not exist.</p>
<blockquote>
<p>For instance, consider $f(x,y) = sin(x+y)$ from <a href="https://arxiv.org/abs/1902.00618">here</a>. Check that none of the points on this function are simultaneously a local minimum for $x$ and local maximum for $y$.</p>
</blockquote>
<p>The fact that no local saddle exists may be surprising, since an $\varepsilon$-global solution to a min-max optimization problem <em>is</em> guaranteed to exist as long as the objective function is uniformly bounded.
Roughly, this is because, in a global min-max setting, the max-player is empowered to globally maximize the function $f(x,\cdot)$, and the min-player is empowered to minimize the “global max” function $\max_y(f(x, \cdot))$.</p>
<p>The ability to compute the global max allows the min-player to predict the max-player’s response.
If $x$ is a global minimum of $\max_y(f(x, \cdot))$, the min-player is aware of this fact and will have no incentive to update $x$.
On the other hand, if the min-player can only simulate the max-player’s updates locally (as in local saddle),
then the min-player may try to update her strategy even when it leads to a net increase in $f$.
This can happen because the min-player is not powerful enough to accurately simulate the max-player’s response. (See a <a href="https://arxiv.org/abs/1902.00618">related notion</a> of local optimality with similar issues due to vanishingly small updates.)</p>
<p>The fact that players who can only make local predictions are
unable to predict their opponents’ responses can lead to convergence problems in many popular algorithms such as<br />
gradient descent ascent (GDA). This non-convergence behavior can occur if the function has no local saddle point (e.g. the function $sin(x+y)$ mentioned above), and can even happen on some functions, like $f(x,y) = xy$ which do have a local saddle point.</p>
<div style="text-align:center;">
<img src="/assets/GDA_spiral_fast.gif" alt="" />
<br />
<b>Figure 1.</b> GDA spirals off to infinity from almost every starting point on the objective function $f(x,y) = xy$.
</div>
<p><br /></p>
<h2 id="greedy-max-a-computationally-tractable-alternative-to-global-max">Greedy max: a computationally tractable alternative to global max</h2>
<p>To allow for a more stable min-player, and a more stable notion of local optimality, we would like to empower the min-player to more effectively simulate the max-player’s response.
While the notion of global min-max does exactly this by having the min-player compute the global max function $\max_y(f(\cdot,y))$, computing the global maximum may be intractable.</p>
<p>Instead, we replace the global max function $\max_y (f(\cdot ,y))$ with a computationally tractable alternative.
Towards this end, we restrict the max-player’s response, and the min-player’s simulation of this response, to updates which can be computed using any algorithm from a class of second-order optimization algorithms.
More specifically, we restrict the max-player to updating $y$ by traveling along continuous paths which start at the current value of $y$ and along which either $f$ is increasing or the second derivative of $f$ is positive. We refer to such paths as greedy paths since they model a class of second-order “greedy” optimization algorithms.</p>
<blockquote>
<p><strong>Greedy path:</strong> A unit-speed path $\varphi:[0,\tau] \rightarrow \mathbb{R}^d$ is greedy if $f$ is non-decreasing over this path, and for every $t\in[0,\tau]$
\(\frac{\mathrm{d}}{\mathrm{d}t} f(x, \varphi(t)) > \varepsilon \ \ \textrm{or} \ \ \frac{\mathrm{d}^2}{\mathrm{d}t^2} f(x, \varphi(t)) > \sqrt{\varepsilon}.\)</p>
</blockquote>
<p>Roughly speaking, when restricted to updates obtained from greedy paths, the max-player will always be able to reach a point which is an approximate local maximum for $f(x,\cdot)$, although there may not be a greedy path which leads the max-player to a global maximum.</p>
<div style="text-align:center;">
<img style="width:400px;" src="/assets/greedy_region_omega_t.png" alt="" /> <img style="width:400px;" src="/assets/global_max_path_no_axes_t.png" alt="" />
<br />
<b>Figure 2.</b> <i>Left:</i> The light-colored region $\Omega$ is reachable from the initial point $A$ by a greedy path; the dark region is not reachable. <i>Right:</i> There is always a greedy path from any point $A$ to a local maximum ($B$), but a global maximum ($C$) may not be reachable by any greedy path.
</div>
<p><br /></p>
<p>To define an alternative to $\max_y(f(\cdot,y))$, we consider the local maximum point with the largest value of $f(x,\cdot)$ attainable from a given starting point $y$ by any greedy path.
We refer to the value of $f$ at this point as the <em>greedy max function</em>, and denote this value by $g(x,y)$.</p>
<blockquote>
<p><strong>Greedy max function:</strong>
$g(x,y) = \max_{z \in \Omega} f(x,z),$
where $\Omega$ is points reachable from $y$ by greedy path.</p>
</blockquote>
<h2 id="our-greedy-min-max-equilibrium">Our greedy min-max equilibrium</h2>
<p>We use the greedy max function to define a new second-order notion of local optimality for min-max optimization, which we refer to as a greedy min-max equilibrium.
Roughly speaking, we say that $(x,y)$ is a greedy min-max equilibrium if
1) $y$ is a local maximum for $f(x,\cdot)$ (and hence the endpoint of a greedy path), and
2) if $x$ is a local minimum of the greedy max function $g(\cdot,y)$.</p>
<p>In other words, $x$ is a local minimum of $\max_y f(\cdot, y)$ under the constraint that the maximum is computed only over the set of greedy paths starting at $y$.
Unfortunately, even if $f$ is smooth, the greedy max function may not be differentiable with respect to $x$ and may even be discontinuous.</p>
<div style="text-align:center;">
<img src="/assets/discontinuity2_grid_t.png" width="400" alt="" /> <img src="/assets/discontinuity2g_grid_t.png" width="400" alt="" />
<br />
<b>Figure 3.</b> <i>Left:</i> If we change $x$ from one value $x$ to a very close value $\hat{x}$, the largest value of $f$ reachable by greedy path undergoes a discontinuous change. <i>Right:</i> This means the greedy max function $g(x,y)$ is discontinuous in $x$.</div>
<p><br /></p>
<p>This creates a problem, since the definition of $\varepsilon$-local minimum only applies to smooth functions.</p>
<p>To solve this problem we would ideally like to smooth $g$ by convolution with a Gaussian.
Unfortunately, convolution can cause the local minima of a function to “shift”– a point which is a local minimum for $g$ may no longer be a local minimum for the convolved version of $g$ (to see why, try convolving the function $f(x) = x - 3x I(x\leq 0) + I(x \leq 0)$ with a Gaussian $N(0,\sigma^2)$ for any $\sigma>0$).
To avoid this, we instead consider a “truncated” version of $g$, and then convolve this function in the $x$ variable with a Gaussian to obtain our smoothed version of $g$.</p>
<p>This allows us to define a notion of greedy min-max equilibrium. We say that a point $(x^\star, y^\star)$ is a greedy min-max equilibrium if $y^\star$ is an approximate local maximum of $f(x^\star, \cdot)$, and $x^\star$ is an $\varepsilon$-local minimum of this smoothed version of $g(\cdot, y^\star)$.</p>
<blockquote>
<p><b>Greedy min-max equilibrium:</b>
$(x^{\star}, y^{\star})$ is an $\varepsilon$-greedy min-max equilibrium if
<br />
\(\|\nabla_y f(x^\star,y^\star)\| \leq \varepsilon, \qquad \nabla^2_y f(x^\star,y^\star) \preceq \sqrt{\varepsilon},\)
<br />
\(\|\nabla_x S(x^{\star},y^{\star})\| \leq \varepsilon \qquad \nabla^2_x S(x^{\star},y^{\star}) \succeq -\sqrt{\varepsilon}, \\\)
<br />
where $S(x,y):= \mathrm{smooth}_x(\mathrm{truncate}(g(x, y))$.</p>
</blockquote>
<p>Any point which is a local saddle point (talked about earlier) also satisifeis our equilibrium conditions. The converse, however, cannot be true as a local saddle point may not always exist. Further, for compactly supported convex-concave functions a point is a greedy min-max equilibrium (in an appropriate sense) if and only if it is a global min-max point. (See Section 7 and Appendix A respectively in <a href="https://arxiv.org/abs/2006.12363">our paper</a>.)</p>
<h2 id="greedy-min-max-equilibria-always-exist-and-can-be-found-efficiently">Greedy min-max equilibria always exist! (And can be found efficiently)</h2>
<p>In <a href="https://arxiv.org/abs/2006.12363">this paper</a> we show: A greedy min-max equilibrium is always guaranteed to exist provided that $f$ is uniformly bounded with Lipschitz Hessian. We do so by providing an algorithm which converges to a greedy min-max equilibrium, and, moreover, we show that it is able to do this in polynomial time from any initial point:</p>
<blockquote>
<p><b>Main theorem:</b> Suppose that we are given access to a smooth function $f:\mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ and to its gradient and Hessian. And suppose that $f$ is uniformly bounded by $b>0$ and has $L$-Lipschitz Hessian.
Then given any initial point, our algorithm returns an $\varepsilon$-greedy min-max equilibrium $(x^\star,y^\star)$ of $f$ in $\mathrm{poly}(b, L, d, \frac{1}{\varepsilon})$ time.</p>
</blockquote>
<p>There are a number of difficulties that our algorithm and proof must overcome:
One difficulty in designing an algorithm is that the greedy max function may be discontinuous.
To find an approximate local minimum of a discontinuous function, our algorithm combines a Monte-Carlo hill climbing algorithm with a <a href="https://arxiv.org/abs/cs/0408007">zeroth-order optimization version</a> of stochastic gradient descent.
Another difficulty is that, while one can easily compute a greedy path from any starting point, there may be many different greedy paths which end up at different local maxima.
Searching for the greedy path which leads to the local maximum point with the largest value of $f$ may be infeasible.
In other words the greedy max function $g$ may be intractable to compute.</p>
<div style="text-align:center;">
<img src="/assets/greedy_paths_no_axes_t.png" width="400" alt="" />
<br />
<b>Figure 4.</b>There are many different greedy paths that start at the same point $A$. They can end up at different local maxima ($B$, $D$), with different values of $f$. In many cases it may be intractable to search over all these paths to compute the greedy max function.
</div>
<p><br /></p>
<p>To get around this problem, rather than computing the exact value of $g(x,y)$, we instead compute a lower bound $h(x,y)$ for the greedy max function. Since we are able to obtain this lower bound by computing only a <em>single</em> greedy path, it is much easier to compute than greedy max function.</p>
<p>In our paper, we prove that if 1) $x^\star$ is an approximate local minimum for the this lower bound $h(\cdot, y^\star)$, and 2) $y^\star$ is a an approximate local maximum for $f(x^\star, \cdot)$, then $x^\star$ is also an approximate local minimum for the greedy max $g(\cdot, y^\star)$.
This allows us to design an algorithm which obtains a greedy min-max point by minimizing the computationally tractable lower bound $h$, instead of the greedy max function which may be intractable to compute.</p>
<h2 id="to-conclude">To conclude</h2>
<p>In this post we have shown how to extend a notion of second-order equilibrium for minimization to min-max optimization which is guaranteed to exist for any function which is bounded and Lipschitz, with Lipschitz gradient and Hessian.
We have also shown that our algorithm is able to find this equilibrium in polynomial time from any initial point.</p>
<blockquote>
<p>Our results do not require any additional assumptions such as convexity, monotonicity, or sufficient bilinearity.</p>
</blockquote>
<p>In an upcoming blog post we will show how one can use some of the ideas from here to obtain a new min-max optimization algorithm with applications to stably training GANs.</p>
Wed, 24 Jun 2020 03:00:00 -0700
http://offconvex.github.io/2020/06/24/equilibrium-min-max/
http://offconvex.github.io/2020/06/24/equilibrium-min-max/Exponential Learning Rate Schedules for Deep Learning (Part 1)<p>This blog post concerns our <a href="https://arxiv.org/pdf/1910.07454.pdf">ICLR20 paper</a> on a surprising discovery about learning rate (LR), the most basic hyperparameter in deep learning.</p>
<p>As illustrated in many online blogs, setting LR too small might slow down the optimization, and setting it too large might make the network overshoot the area of low losses. The standard mathematical analysis for the right choice of LR relates it to <a href="https://en.wikipedia.org/wiki/Smoothness">smoothness</a> of the loss function.</p>
<p>Many practitioners use a ‘step decay’ LR schedule, which systematically drops the LR after specific training epochs. One often hears the intuition—with some mathematical justification if one treats SGD as a random walk in the loss landscape— that large learning rates are useful in the initial (“exploration”) phase of training whereas lower rates in later epochs allow a slow settling down to a local minimum in the landscape. Intriguingly, this intuition is called into question by the success of exotic learning rate schedules such as <a href="https://arxiv.org/abs/1608.03983">cosine</a> (Loshchilov&Hutter, 2016), and <a href="https://arxiv.org/abs/1506.01186">triangular</a> (Smith, 2015), featuring an oscillatory LR. These divergent approaches suggest that LR, the most basic and intuitive hyperparameter in deep learning, has not revealed all its mysteries yet.</p>
<div style="text-align:center;">
<img style="width:450px;" src="http://www.offconvex.org/assets/lr_schedules.png" />
<br />
<b>Figure 1.</b> Examples of Step Decay, Triangular and Cosine LR schedules.
</div>
<p><br /></p>
<h1 id="surprise-exponentially-increasing-lr">Surprise: Exponentially increasing LR</h1>
<p>We report experiments that state-of-the-art networks for image recognition tasks can be trained with an exponentially increasing LR (ExpLR): in each iteration it increases by $(1+\alpha)$ for some $\alpha > 0$. (The $\alpha$ can be varied over epochs.) Here $\alpha$ is not too small in our experiments, so as you would imagine, the LR hits astronomical values in no time. To the best of our knowledge, this is the first time such a rate schedule has been successfully used, let alone for highly successful architectures. In fact, as we will see below, the reason we even did this bizarre experiment was that we already had a mathematical proof that it would work. Specifically, we could show that such ExpLR schedules are at least as powerful as the standard step-decay ones, by which we mean that ExpLR can let us achieve (in function space) all the nets obtainable via the currently popular step-decay schedules.</p>
<h2 id="so-why-does-this-work">So why does this work?</h2>
<p>One key property of state-of-the-art nets we rely on is that they all use some normalization of parameters within layers, usually Batch Norm (BN), which has been shown to give benefits in optimization and generalization across architectures. Our result also holds for other normalizations, including Group Normalization (Wu & He, 2018), Layer Normalization (Ba et al., 2016), Instance Norm (Ulyanov et al., 2016), etc.</p>
<p>The second key property of current training is that they use weight decay (aka $\ell_2$ regularizer). When combined with BN, this implies strange dynamics in parameter space, and the experimental papers (<a href="https://arxiv.org/abs/1706.05350">van Laarhoven, 2017</a>, <a href="https://arxiv.org/abs/1803.01814">Hoffer et al., 2018a</a> and <a href="https://openreview.net/forum?id=B1lz-3Rct7">Zhang et al., 2019</a>), noticed that combining BN and weight decay can be viewed as increasing the LR.</p>
<p>Our paper gives a rigorous proof of the power of ExpLR by showing the following about the end-to-end function being computed (see Main Thm in the paper):</p>
<blockquote>
<p>(Informal Theorem) For commonly used values of the paremeters, every net produced by <em>Weight Decay + Constant LR + BN + Momentum</em> can also be produced (in function space) via <em>ExpLR + BN + Momentum</em></p>
</blockquote>
<p>*NB: If the LR is not fixed but decaying in discrete steps, then the equivalent ExpLR training decays the exponent. (See our paper for details.)</p>
<p>At first sight such a claim may seem difficult (if not impossible) to prove given that we lack any mathematical characterization of nets produced by training (note that the theorem makes no mention of the dataset!). The equivalence is shown by reasoning about <em>trajectory</em> of optimization, instead of the usual “landscape view” of stationary points, gradient norms, Hessian norms, smoothness, etc.. This is an example of the importance of trajectory analysis, as argued in <a href="http://www.offconvex.org/2019/06/03/trajectories/">earlier blog post of Sanjeev’s</a> because optimization and generalization are deeply intertwined for deep learning. Conventional wisdom says LR controls optimization, and the regularizer controls generalization. Our result shows that the effect of weight decay can under fairly normal conditions be * exactly* realized by the ExpLR rate schedule.</p>
<div style="text-align:center;">
<img style="width:550px;" src="http://www.offconvex.org/assets/exp_lr.png" />
</div>
<p><strong>Figure 2.</strong> Training PreResNet32 on CIFAR10 with fixed LR $0.1$, momentum $0.9$ and other standard hyperparameters. Trajectory was unchanged when WD was turned off and LR at iteration $t$ was $\tilde{\eta}_ t = 0.1\times1.481^t$. (The constant $1.481$ is predicted by our theory given the original hyperparameters.) Plot on right shows weight norm $\pmb{w}$ of the first convolutional layer in the second residual block. It grows exponentially as one would expect, satisfying $|\pmb{w}_ t|_ 2^2/\tilde{\eta}_ t = $ constant.</p>
<h2 id="scale-invariance-and-equivalence">Scale Invariance and Equivalence</h2>
<p>The formal proof holds for any training loss satisfying
what we call <em>Scale Invariance</em>:</p>
\[L (c\cdot \pmb{\theta}) = L(\pmb{\theta}), \quad \forall \pmb{\theta}, \forall c >0.\]
<p>BN and other normalization schemes result in a Scale-Invariant Loss for the popular deep architectures (Convnet, Resnet, DenseNet etc.) if the output layer –where normally no normalization is used– is fixed throughout training. Empirically, <a href="https://openreview.net/forum?id=S1Dh8Tg0-">Hoffer et al. (2018b)</a> found that randomly fixing the output layer at the start does not harm the final accuracy.
(Appendix C of our paper demonstrates scale invariance for various architectures; it is somewhat nontrivial.)</p>
<p>For batch ${\mathcal{B}} = \{ x_ i \} _ {i=1}^B$, network parameter ${\pmb{\theta}}$, we denote the network by $f_ {\pmb{\theta}}$ and the loss function at iteration $t$ by $L_ t(f_ {\pmb{\theta}}) = L(f_ {\pmb{\theta}}, {\mathcal{B}}_ t)$ . We also use $L_ t({\pmb{\theta}})$ for convenience. We say the network $f_ {\pmb{\theta}}$ is <em>scale invariant</em> if $\forall c>0$, $f_ {c{\pmb{\theta}}} = f_ {\pmb{\theta}}$, which implies the loss $L_ t$ is also scale invariant, i.e., $L_ t(c{\pmb{\theta}}_ t)=L_ t({\pmb{\theta}}_ t)$, $\forall c>0$. A key source of intuition is the following lemma provable via chain rule:</p>
<blockquote>
<p><strong>Lemma 1</strong>. A scale-invariant loss $L$ satisfies
(1). $\langle\nabla_ {\pmb{\theta}} L, {\pmb{\theta}} \rangle=0$ ;<br />
(2). $\left.\nabla_ {\pmb{\theta}} L \right|_ {\pmb{\theta} = \pmb{\theta}_ 0} = c \left.\nabla_ {\pmb{\theta}} L\right|_ {\pmb{\theta} = c\pmb{\theta}_ 0}$, for any $c>0$.</p>
</blockquote>
<p>The first property immediately implies that $|{\pmb{\theta}}_ t|$ is monotone increasing for SGD if WD is turned off by Pythagoren Theorem. And based on this, <a href="https://arxiv.org/pdf/1812.03981.pdf">our previous work</a> with Kaifeng Lyu shows that GD with any fixed learning rate can reach $\varepsilon$ approximate stationary point for scale invariant objectives in $O(1/\varepsilon^2)$ iterations.</p>
<div style="text-align:center;">
<img style="width:360px;" src="http://www.offconvex.org/assets/inv_lemma.png" />
<br />
<b>Figure 3.</b> Illustration of Lemma 1.
</div>
<p><br /></p>
<p>Below is the main result of the paper. We will explain the proof idea (using scale-invariance) in a later post.</p>
<blockquote>
<p><strong>Theorem 1(Main, Informal).</strong> SGD on a scale-invariant objective with initial learning rate $\eta$, weight decay factor $\lambda$, and momentum factor $\gamma$ is equivalent to SGD with momentum factor $\gamma$ where at iteration $t$, the ExpLR $\tilde{\eta}_ t$ is defined as $\tilde{\eta}_ t = \alpha^{-2t-1} \eta$ without weight decay($\tilde{\lambda} = 0$) where $\alpha$ is a non-zero root of equation
\(x^2-(1+\gamma-\lambda\eta)x + \gamma = 0,\)</p>
</blockquote>
<blockquote>
<p>Specifically, when momentum $\gamma=0$, the above schedule can be simplified as $\tilde{\eta}_ t = (1-\lambda\eta)^{-2t-1} \eta$.</p>
</blockquote>
<h3 id="sota-performance-with-exponential-lr">SOTA performance with exponential LR</h3>
<p>As mentioned, reaching state-of-the-art accuracy requires reducing the learning rate a few times. Suppose the training has $K$ phases, and the learning rate is divided by some constant $C_I>1$ when entering phase $I$. To realize the same effect with an exponentially increasing LR, we have:</p>
<blockquote>
<p><strong>Theorem 2:</strong> ExpLR with the below modification generates the same network sequence as Step Decay with momentum factor $\gamma$ and WD $\lambda$ does. We call it <em>Tapered Exponential LR schedule</em> (TEXP).<br />
<strong>Modification when entering a new phase $I$</strong>: (1). switching to some smaller exponential growing rate; (2). divinding the current LR by $C_I$.</p>
</blockquote>
<div style="text-align:center;">
<img style="width:235px;" src="http://www.offconvex.org/assets/texp_lr.png" />
<img style="width:500px;" src="http://www.offconvex.org/assets/TEXP.png" />
</div>
<p><strong>Figure 5.</strong> PreResNet32 trained with Step Decay (as in Figure 1) and its corresponding TEXP schedule. As predicted by Theorem 2, they have similar trajectories and performances.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We hope that this bit of theory and supporting experiments have changed your outlook on learning rates for deep learning.</p>
<p>A follow-up post will present the proof idea and give more insight into why ExpLR suggests a rethinking of the “landscape view” of optimization in deep learning.</p>
Fri, 24 Apr 2020 03:00:00 -0700
http://offconvex.github.io/2020/04/24/ExpLR1/
http://offconvex.github.io/2020/04/24/ExpLR1/Ultra-Wide Deep Nets and Neural Tangent Kernel (NTK)<p>(Crossposted <a href="https://blog.ml.cmu.edu/2019/10/03/ultra-wide-deep-nets-and-the-neural-tangent-kernel-ntk/">at CMU ML</a>.)</p>
<p>Traditional wisdom in machine learning holds that there is a careful trade-off between training error and generalization gap. There is a “sweet spot” for the model complexity such that the model (i) is big enough to achieve reasonably good training error, and (ii) is small enough so that the generalization gap - the difference between test error and training error - can be controlled. A smaller model would give a larger training error, while making the model bigger would result in a larger generalization gap, both leading to larger test errors. This is described by the classical U-shaped curve for the test error when the model complexity varies (see Figure 1(a)).</p>
<p>However, it is common nowadays to use highly complex over-parameterized models like deep neural networks. These models are usually trained to achieve near zero error on the training data, and yet they still have remarkable performance on test data. <a href="https://arxiv.org/abs/1812.11118">Belkin et al. (2018)</a> characterized this phenomenon by a “double descent” curve which extends the classical U-shaped curve. It was observed that, as one increases the model complexity past the point where it can perfectly fits the training data (i.e., <em>interpolation</em> regime is reached), test error continues to drop! Interestingly, the best test error is often achieved by the largest model, which goes against the classical intuition about the “sweet spot.” The following figure from <a href="https://arxiv.org/abs/1812.11118">Belkin et al. (2018)</a> illustrates this phenomenon.</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/belkinfig.jpg" />
<br />
<b>Figure 1.</b> Effect of increased model complexity on generalization: traditional belief vs actual practice.
</div>
<p><br /></p>
<p>Consequently one suspects that the training algorithms used in deep learning - (stochastic) gradient descent and its variants - somehow implicitly constrain the complexity of trained networks (i.e., “true number” of parameters), thus leading to a small generalization gap.</p>
<p>Since larger models often give better performance in practice, one may naturally wonder:</p>
<blockquote>
<p>How does an infinitely wide net perform?</p>
</blockquote>
<p>The answer to this question corresponds to the right end of Figure 1(b). This blog post is about a model that has attracted a lot of attention in the past year: deep learning in the regime where the width - namely, the number of channels in convolutional filters, or the number of neurons in fully-connected internal layers - goes to infinity. At first glance this approach may seem hopeless for both practitioners and theorists: all the computing power in the world is insufficient to train an infinite network, and theorists already have their hands full trying to figure out finite ones. But in math/physics there is a tradition of deriving insights into questions by studying them in the infinite limit, and indeed here too the infinite limit becomes easier for theory.</p>
<p>Experts may recall the connection between infinitely wide neural networks and kernel methods from 25 years ago by <a href="https://www.cs.toronto.edu/~radford/pin.abstract.html">Neal (1994)</a> as well as the recent extensions by <a href="https://openreview.net/forum?id=B1EA-M-0Z">Lee et al. (2018)</a> and <a href="https://arxiv.org/abs/1804.11271">Matthews et al. (2018)</a>. These kernels correspond to infinitely wide deep networks whose all parameters are chosen randomly, and <em>only the top (classification) layer is trained</em> by gradient descent. Specifically, if $f(\theta,x)$ denotes the output of the network on input $x$ where $\theta$ denotes the parameters in the network, and $\mathcal{W}$ is an initialization distribution over $\theta$ (usually Gaussian with proper scaling), then the corresponding kernel is
\(\mathrm{ker} \left(x,x'\right) = \mathbb{E}_{\theta \sim \mathcal{W}}[f\left(\theta,x\right)\cdot f\left(\theta,x'\right)]\)
where $x,x’$ are two inputs.</p>
<p>What about the more usual scenario when <em>all layers are trained</em>? Recently, <a href="https://arxiv.org/pdf/1806.07572.pdf">Jacot et al. (2018)</a> first observed that this is also related to a kernel named <em>neural tangent kernel (NTK)</em>, which has the form
\(\mathrm{ker} \left(x,x'\right) = \mathbb{E}_{\theta \sim \mathcal{W}}\left[\left\langle \frac{\partial f\left(\theta,x\right)}{\partial \theta}, \frac{\partial f\left(\theta,x'\right)}{\partial \theta}\right\rangle\right].\)</p>
<p>The key difference between the NTK and previously proposed kernels is that the NTK is defined through the inner product between the gradients of the network outputs with respect to the network parameters. This gradient arises from the use of the gradient descent algorithm. Roughly speaking, the following conclusion can be made for a sufficiently wide deep neural network trained by gradient descent:</p>
<blockquote>
<p>A properly randomly initialized <strong>sufficiently wide</strong> deep neural network <strong>trained by gradient descent</strong> with infinitesimal step size (a.k.a. gradient flow) is <strong>equivalent to a kernel regression predictor</strong> with a <strong>deterministic</strong> kernel called <em>neural tangent kernel (NTK)</em>.</p>
</blockquote>
<p>This was more or less established in the original paper of <a href="https://arxiv.org/pdf/1806.07572.pdf">Jacot et al. (2018)</a>, but they required the width of every layer to go to infinity in a sequential order. In <a href="https://arxiv.org/abs/1904.11955">our recent paper</a> with Sanjeev Arora, Zhiyuan Li, Ruslan Salakhutdinov and Ruosong Wang, we improve this result to the non-asymptotic setting where the width of every layer only needs to be greater than a certain finite threshold.</p>
<p>In the rest of this post we will first explain how NTK arises and the idea behind the proof of the equivalence between wide neural networks and NTKs. Then we will present experimental results showing how well infinitely wide neural networks perform in practice.</p>
<h2 id="how-does-neural-tangent-kernel-arise">How Does Neural Tangent Kernel Arise?</h2>
<p>Now we describe how training an ultra-wide fully-connected neural network leads to kernel regression with respect to the NTK. A more detailed treatment is given in <a href="https://arxiv.org/abs/1904.11955">our paper</a>. We first specify our setup. We consider the standard supervised learning setting, in which we are given $n$ training data points ${(x_i,y_i)}_{i=1}^n \subset \mathbb{R}^{d}\times\mathbb{R}$ drawn from some underlying distribution and wish to find a function that given the input $x$ predicts the label $y$ well on the data distribution. We consider a fully-connected neural network defined by $f(\theta, x)$, where $\theta$ is the collection of all the parameters in the network and $x$ is the input. For simplicity we only consider neural network with a single output, i.e., $f(\theta, x) \in \mathbb{R}$, but the generalization to multiple outputs is straightforward.</p>
<p>We consider training the neural network by minimizing the quadratic loss over training data:
\(\ell(\theta) = \frac{1}{2}\sum_{i=1}^n (f(\theta,x_i)-y_i)^2.\)
Gradient descent with infinitesimally small learning rate (a.k.a. gradient flow) is applied on this loss function $\ell(\theta)$: \(\frac{d \theta(t)}{dt} = - \nabla \ell(\theta(t)),\)
where $\theta(t)$ denotes the parameters at time $t$.</p>
<p>Let us define some useful notation. Denote $u_i = f(\theta, x_i)$, which is the network’s output on $x_i$. We let $u=(u_1, \ldots, u_n)^\top \in \mathbb{R}^n$ be the collection of the network outputs on all training inputs. We use the time index $t$ for all variables that depend on time, e.g. $u_i(t), u(t)$, etc. With this notation the training objective can be conveniently written as $\ell(\theta) = \frac12 |u-y|_2^2$.</p>
<p>Using simple differentiation, one can obtain the dynamics of $u(t)$ as follows: (see <a href="https://arxiv.org/abs/1904.11955">our paper</a> for a proof)
\(\frac{du(t)}{dt} = -H(t)\cdot(u(t)-y),\)
where $H(t)$ is an $n\times n$ positive semidefinite matrix whose $(i, j)$-th entry is $\left\langle \frac{\partial f(\theta(t), x_i)}{\partial\theta}, \frac{\partial f(\theta(t), x_j)}{\partial\theta} \right\rangle$.</p>
<p>Note that $H(t)$ is the <em>kernel matrix</em> of the following (time-varying) kernel $ker_t(\cdot,\cdot)$ evaluated on the training data:
\(ker_t(x,x') = \left\langle \frac{\partial f(\theta(t), x)}{\partial\theta}, \frac{\partial f(\theta(t), x')}{\partial\theta} \right\rangle, \quad \forall x, x' \in \mathbb{R}^{d}.\)
In this kernel an input $x$ is mapped to a feature vector $\phi_t(x) = \frac{\partial f(\theta(t), x)}{\partial\theta}$ defined through the gradient of the network output with respect to the parameters at time $t$.</p>
<p>###The Large Width Limit</p>
<p>Up to this point we haven’t used the property that the neural network is very wide. The formula for the evolution of $u(t)$ is valid in general. In the large width limit, it turns out that the time-varying kernel $ker_t(\cdot,\cdot)$ is (with high probability) always close to a <em>deterministic</em> fixed kernel $ker_{\mathsf{NTK}}(\cdot,\cdot)$, which is the <strong>neural tangent kernel (NTK)</strong>. This property is proved in two steps, both requiring the large width assumption:</p>
<ol>
<li>
<p><strong>Step 1: Convergence to the NTK at random initialization.</strong> Suppose that the network parameters at initialization ($t=0$), $\theta(0)$, are i.i.d. Gaussian. Then under proper scaling, for any pair of inputs $x, x’$, it can be shown that the random variable $ker_0(x,x’)$, which depends on the random initialization $\theta(0)$, converges in probability to the deterministic value $ker_{\mathsf{NTK}}(x,x’)$, in the large width limit.</p>
<p>(Technically speaking, there is a subtlety about how to define the large width limit. <a href="https://arxiv.org/pdf/1806.07572.pdf">Jacot et al. (2018)</a> gave a proof for the sequential limit where the width of every layer goes to infinity one by one. Later <a href="https://arxiv.org/abs/1902.04760">Yang (2019)</a> considered a setting where all widths go to infinity at the same rate. <a href="https://arxiv.org/abs/1904.11955">Our paper</a> improves them to the non-asymptotic setting, where we only require all layer widths to be larger than a finite threshold, which is the weakest notion of limit.)</p>
</li>
<li>
<p><strong>Step 2: Stability of the kernel during training.</strong> Furthermore, the kernel <em>barely changes</em> during training, i.e., $ker_t(x,x’) \approx ker_0(x,x’)$ for all $t$. The reason behind this is that the weights do not move much during training, namely $\frac{|\theta(t) - \theta(0)|}{|\theta(0)|} \to 0$ as width $\to\infty$. Intuitively, when the network is sufficiently wide, each individual weight only needs to move a tiny amount in order to have a non-negligible change in the network output. This turns out to be true when the network is trained by gradient descent.</p>
</li>
</ol>
<p>Combining the above two steps, we conclude that for any two inputs $x, x’$, with high probability we have
\(ker_t(x,x') \approx ker_0(x,x') \approx ker_{\mathsf{NTK}}(x,x'), \quad \forall t.\)
As we have seen, the dynamics of gradient descent is closely related to the time-varying kernel $ker_t(\cdot,\cdot)$. Now that we know that $ker_t(\cdot,\cdot)$ is essentially the same as the NTK, with a few more steps, we can eventually establish the equivalence between trained neural network and NTK: the final learned neural network at time $t=\infty$, denoted by $f_{\mathsf{NN}}(x) = f(\theta(\infty), x)$, is equivalent to the <em>kernel regression</em> solution with respect to the NTK. Namely, for any input $x$ we have
\(f_{\mathsf{NN}}(x) \approx f_{\mathsf{NTK}}(x) = ker_{\mathsf{NTK}}(x, X)^\top \cdot ker_{\mathsf{NTK}}(X, X)^{-1} \cdot y,\)
where $ker_{\mathsf{NTK}}(x, X) = (ker_{\mathsf{NTK}}(x, x_1), \ldots, ker_{\mathsf{NTK}}(x, x_n))^\top \in \mathbb{R}^n$, and $ker_{\mathsf{NTK}}(X, X) $ is an $n\times n$ matrix whose $(i, j)$-th entry is $ker_{\mathsf{NTK}}(x_i, x_j)$.</p>
<p>(In order to not have a bias term in the kernel regression solution we also assume that the network output at initialization is small: $f(\theta(0), x)\approx0$; this can be ensured by e.g. scaling down the initialization magnitude by a large constant, or replicating a network with opposite signs on the top layer at initialization.)</p>
<h2 id="how-well-do-infinitely-wide-neural-networks-perform-in-practice">How Well Do Infinitely Wide Neural Networks Perform in Practice?</h2>
<p>Having established this equivalence, we can now address the question of how well infinitely wide neural networks perform in practice — we can just evaluate the kernel regression predictors using the NTKs! We test NTKs on a standard image classification dataset, CIFAR-10. Note that for image datasets, one needs to use convolutional neural networks (CNNs) to achieve good performance. Therefore, we derive an extension of NTK, <em>convolutional neural tangent kernels (CNTKs)</em> and test their performance on CIFAR-10. In the table below, we report the classification accuracies of different CNNs and CNTKs:</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/cntk_acc.jpeg" />
<br />
</div>
<p>Here CNN-Vs are vanilla practically-wide CNNs (without pooling), and CNTK-Vs are their NTK counterparts. We also test CNNs with global average pooling (GAP), denotes above as CNN-GAPs, and their NTK counterparts, CNTK-GAPs. For all experiments, we turn off batch normalization, data augmentation, etc., and only use SGD to train CNNs (for CNTKs, we use the closed-form formula of kernel regression).</p>
<p>We find that CNTKs are actually very power kernels. The best kernel we find, 11-layer CNTK with GAP, achieves 77.43% classification accuracy on CIFAR-10. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than methods reported by <a href="https://openreview.net/forum?id=B1g30j0qF7">Novak et al. (2019)</a>. The CNTKs also perform similarly to their CNN counterparts. This means that ultra-wide CNNs can achieve reasonable test performance on CIFAR-10.</p>
<p>It is also interesting to see that the global average pooling operation can significantly increase the classification accuracy for both CNNs and CNTKs. From this observation, we suspect that many techniques that improve the performance of neural networks are in some sense universal, i.e., these techniques might benefit kernel methods as well.</p>
<h2 id="concluding-thoughts">Concluding Thoughts</h2>
<p>Understanding the surprisingly good performance of over-parameterized deep neural networks is definitely a challenging theoretical question. Now, at least we have a better understanding of a class of ultra-wide neural networks: they are captured by neural tangent kernels! A hurdle that remains is that the classic generalization theory for kernels is still incapable of giving realistic bounds for generalization. But at least we now know that better understanding of kernels can lead to better understanding of deep nets.</p>
<p>Another fruitful direction is to “translate” different architectures/tricks of neural networks to kernels and to check their practical performance. We have found that global average pooling can significantly boost the performance of kernels, so we hope other tricks like batch normalization, dropout, max-pooling, etc. can also benefit kernels. Similarly, one can try to translate other architectures like recurrent neural networks, graph neural networks, and transformers, to kernels as well.</p>
<p>Our study also shows that there is a performance gap between infinitely wide networks and finite ones. How to explain this gap is an important theoretical question.</p>
Thu, 03 Oct 2019 03:00:00 -0700
http://offconvex.github.io/2019/10/03/NTK/
http://offconvex.github.io/2019/10/03/NTK/Understanding implicit regularization in deep learning by analyzing trajectories of gradient descent<p>Sanjeev’s <a href="http://www.offconvex.org/2019/06/03/trajectories/">recent blog post</a> suggested that the conventional view of optimization is insufficient for understanding deep learning, as the value of the training objective does not reliably capture generalization.
He argued that instead, we need to consider the <em>trajectories</em> of optimization.
One of the illustrative examples given was our <a href="https://arxiv.org/abs/1905.13655">new paper with Sanjeev Arora and Yuping Luo</a>, which studies the use of deep linear neural networks for solving <a href="https://en.wikipedia.org/wiki/Matrix_completion"><em>matrix completion</em></a> more accurately than the classic convex programming approach.
The current post provides more details on this result.</p>
<p>Recall that in matrix completion we are given some entries $\{ M_{i, j} : (i, j) \in \Omega \}$ of an unknown <em>ground truth</em> matrix $M$, and our goal is to recover the remaining entries.
This can be thought of as a supervised learning (regression) problem, where the training examples are the observed entries of $M$, the model is a matrix $W$ trained with the loss:
[
L(W) = \sum\nolimits_{(i, j) \in \Omega} (W_{i, j} - M_{i, j})^2 ~,
]
and generalization corresponds to how similar $W$ is to $M$ in the unobserved locations.
Obviously the problem is ill-posed if we assume nothing about $M$ $-$ the loss $L(W)$ is underdetermined, i.e. has multiple optima, and it would be impossible to tell (without access to unobserved entries) if one solution is better than another.
The standard assumption (which has many <a href="https://en.wikipedia.org/wiki/Matrix_completion#Applications">practical applications</a>) is that the ground truth matrix $M$ is low-rank, and thus the goal is to find, from among all global minima of the loss $L(W)$, one with minimal rank.
The classic algorithm for achieving this is to find the matrix with minimum <a href="https://en.wikipedia.org/wiki/Matrix_norm#Schatten_norms"><em>nuclear norm</em></a>.
This is a convex program, which <em>given enough observed entries</em> (and under mild technical assumptions $-$ “incoherence”) recovers the ground truth exactly (cf. <a href="https://statweb.stanford.edu/~candes/papers/MatrixCompletion.pdf">Candes and Recht</a>).
We’re interested in the regime where the number of revealed entries is too small for the classic algorithm to succeed.
There it can be beaten by a simple deep learning approach, as described next.</p>
<h2 id="linear-neural-networks-lnn">Linear Neural Networks (LNN)</h2>
<p>A linear neural network (LNN) is a fully-connected neural network with linear activation (i.e. no non-linearity).
If $W_j$ is the weight matrix in layer $j$ of a depth $N$ network, the <em>end-to-end matrix</em> is given by $W = W_N W_{N-1} \cdots W_1$.
Our method for solving matrix completion involves minimizing the loss $L(W)$ by running gradient descent (GD) on this (over-)parameterization, with depth $N \geq 2$ and hidden dimensions that do not constrain rank.
This can be viewed as a deep learning problem with $\ell_2$ loss, and GD can be implemented through the chain rule as usual.
Note that the training objective does not include any regularization term controlling the individual layer matrices $\{ W_j \}_j$.</p>
<p>At first glance our algorithm seems naive, since parameterization by an LNN (that does not constrain rank) is equivalent to parameterization by a single matrix $W$, and obviously running GD on $L(W)$ directly with no regularization is not a good approach (nothing will be learned in the unobserved locations).
However, since matrix completion is an underdetermined problem (has multiple optima), the optimum reached by GD can vary depending on the chosen parameterization.
Our setup isolates the role of over-parameterization in implicitly biasing GD towards certain optima (that hopefully generalize well).</p>
<p>Note that in the special case of depth $N = 2$ our method reduces to a traditional approach for matrix completion, named <em>matrix factorization</em>.
By analogy, we refer to the case $N \geq 3$ as <em>deep matrix factorization</em>.
The table below shows reconstruction errors (generalization) on a matrix completion task where the number of observed entries is too small for nuclear norm minimization to succeed.
As can be seen, it is outperformed by matrix factorization, which itself is outperformed by deep matrix factorization.</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/trajectories-linear-nets-exp-reconst-errs.png" />
<br />
<b>Table 1:</b> Results for matrix completion with small number of observations.
</div>
<p><br />
The main focus of our paper is on developing a theoretical understanding of this phenomenon.</p>
<h2 id="trajectory-analysis-implicit-regularization-towards-low-rank">Trajectory Analysis: Implicit Regularization Towards Low Rank</h2>
<p>We are interested in understanding what end-to-end matrix $W$ emerges when we run GD on an LNN to minimize a general convex loss $L(W)$, and in particular the matrix completion loss given above.
Note that $L(W)$ is convex, but the objective obtained by over-parameterizing with an LNN is not.
We analyze the trajectories of $W$, and specifically the dynamics of its singular value decomposition.
Denote the singular values by $\{ \sigma_r \}_r$, and the corresponding left and right singular vectors by $\{ \mathbf{u}_r \}_r$ and $\{ \mathbf{v}_r \}_r$ respectively.</p>
<p>We start by considering GD applied to $L(W)$ directly (no over-parameterization).</p>
<blockquote>
<p><strong>Known result:</strong>
Minimizing $L(W)$ directly by GD (with small learning rate $\eta$) leads the singular values of $W$ to evolve by:
[
\sigma_r(t + 1) \leftarrow \sigma_r(t) - \eta \cdot \langle \nabla L(W(t)) , \mathbf{u}_r(t) \mathbf{v}_r^\top(t) \rangle ~.
\qquad (1)
]</p>
</blockquote>
<p>This statement implies that the movement of a singular value is proportional to the projection of the gradient onto the corresponding singular component.</p>
<p>Now suppose that we parameterize $W$ with an $N$-layer LNN, i.e. as $W = W_N W_{N-1} \cdots W_1$.
In previous work (described in <a href="http://www.offconvex.org/2018/03/02/acceleration-overparameterization/">Nadav’s earlier blog post</a>) we have shown that running GD on the LNN, with small learning rate $\eta$ and initialization close to the origin, leads the end-to-end matrix $W$ to evolve by:</p>
\[W(t+1) \leftarrow W(t) - \eta \cdot \sum\nolimits_{j=1}^{N} \left[ W(t) W^\top(t) \right]^\frac{j-1}{N} \nabla{L}(W(t)) \left[ W^\top(t) W(t) \right]^\frac{N-j}{N} ~.\]
<p>In the new paper we rely on this result to prove the following:</p>
<blockquote>
<p><strong>Theorem:</strong>
Minimizing $L(W)$ by running GD (with small learning rate $\eta$ and initialization close to the origin) on an $N$-layer LNN leads the singular values of $W$ to evolve by:
[ \sigma_r(t + 1) \leftarrow \sigma_r(t) - \eta \cdot \langle \nabla L(W(t)) , \mathbf{u}_r(t) \mathbf{v}_r^\top(t) \rangle \cdot \color{purple}{N \cdot (\sigma_r(t))^{2 - 2 / N}} ~.
]</p>
</blockquote>
<p>Comparing this to Equation $(1)$, we see that over-parameterizing the loss $L(W)$ with an $N$-layer LNN introduces the multiplicative factors $\color{purple}{N \cdot (\sigma_r(t))^{2 - 2 / N}}$ to the evolution of singular values.
While the constant $N$ does not change relative dynamics (can be absorbed into the learning rate $\eta$), the terms $(\sigma_r(t))^{2 - 2 / N}$ do $-$ they enhance movement of large singular values, and on the hand attenuate that of small ones.
Moreover, the enhancement/attenuation becomes more significant as $N$ (network depth) grows.</p>
<div style="text-align:center;">
<img style="width:900px;" src="http://www.offconvex.org/assets/trajectories-linear-nets-thm-dynamics.png" />
<br />
<b>Figure 1:</b> Over-parameterizing with LNN modifies dynamics of singular values.
</div>
<p><br /></p>
<p>The enhancement/attenuation effect induced by an LNN (factors $\color{purple}{N \cdot (\sigma_r(t))^{2 - 2 / N}}$) leads each singular value to progress very slowly after initialization, when close to zero, and then, upon reaching a certain threshold, move rapidly, with the transition from slow to rapid movement being sharper in case of a deeper network (larger $N$).
If the loss $L(W)$ is underdetermined (has multiple optima) these dynamics promote solutions that have a few large singular values and many small ones (that have yet to reach the phase transition between slow to rapid movement), with a gap that is more extreme the deeper the network is.
This is an implicit regularization towards low rank, which intensifies with depth.
In the paper we support the intuition with empirical evaluations and theoretical illustrations, demonstrating how adding depth to an LNN can lead GD to produce solutions closer to low-rank.
For example, the following plots, corresponding to a task of matrix completion, show evolution of singular values throughout training of networks with varying depths $-$ as can be seen, adding layers indeed admits a final solution whose spectrum is closer to low-rank, thereby improving generalization.</p>
<div style="text-align:center;">
<img style="width:900px;" src="http://www.offconvex.org/assets/trajectories-linear-nets-exp-dynamics.png" />
<br />
<b>Figure 2:</b> Dynamics of singular values in training matrix factorizations (LNN).
</div>
<h2 id="do-the-trajectories-minimize-some-regularized-objective">Do the Trajectories Minimize Some Regularized Objective?</h2>
<p>In recent years, researchers have come to realize the importance of implicit regularization induced by the choice of optimization algorithm.
The strong gravitational pull of the conventional view on optimization (see <a href="http://www.offconvex.org/2019/06/03/trajectories/">Sanjeev’s post</a>) has led most papers on this line to try and capture the effect in the language of regularized objectives.
For example, it is known that over linear models, i.e. depth $1$ networks, GD finds the solution with minimal Frobenius norm (cf. Section 5 in <a href="https://openreview.net/pdf?id=Sy8gdB9xx">Zhang et al.</a>), and a common hypothesis is that this persists over more elaborate neural networks, with Frobenius norm potentially replaced by some other norm (or quasi-norm) that depends on network architecture.
<a href="https://papers.nips.cc/paper/7195-implicit-regularization-in-matrix-factorization.pdf">Gunasekar et al.</a> explicitly conjectured:</p>
<blockquote>
<p><strong>Conjecture (by <a href="https://papers.nips.cc/paper/7195-implicit-regularization-in-matrix-factorization.pdf">Gunasekar et al.</a>, informally stated):</strong>
GD (with small learning rate and near-zero initialization) training a matrix factorization finds a solution with minimum <a href="https://en.wikipedia.org/wiki/Matrix_norm#Schatten_norms">nuclear norm</a>.</p>
</blockquote>
<p>This conjecture essentially states that matrix factorization (i.e. $2$-layer LNN) trained by GD is equivalent to the famous method of nuclear norm minimization.
Gunasekar et al. motivated the conjecture with some empirical evidence, as well as mathematical evidence in the form of a proof for a (very) restricted setting.</p>
<p>Given the empirical observation by which adding depth to a matrix factorization can improve results in matrix completion, it would be natural to extend the conjecture of Gunasekar et al., and assert that the implicit regularization with depth $3$ or higher corresponds to minimizing some other norm (or quasi-norm) that approximates rank better than nuclear norm does.
For example, a natural candidate would be a <a href="https://en.wikipedia.org/wiki/Schatten_norm">Schatten-$p$ quasi-norm</a> with some $0 < p < 1$.</p>
<p>Our investigation began with this approach, but ultimately, we became skeptical of the entire “implicit regularization as norm minimization” line of reasoning, and in particular of the conjecture by Gunasekar et al.</p>
<blockquote>
<p><strong>Theorem (mathematical evidence against the conjecture):</strong>
In the same restricted setting for which Gunasekar et al. proved their conjecture, nuclear norm is minimized by GD over matrix factorization not only with depth $2$, but with any depth $\geq 3$ as well.</p>
</blockquote>
<p>This theorem disqualifies Schatten quasi-norms as the implicit regularization in deep matrix factorizations, and instead suggests that all depths correspond to nuclear norm.
However, empirically we found a notable difference in performance between different depths, so the conceptual leap from a proof in the restricted setting to a general conjecture, as done by Gunasekar et al., seems questionable.</p>
<p>In the paper we conduct a systematic set of experiments to empirically evaluate the conjecture.
We find that in the regime where nuclear norm minimization is suboptimal (few observed entries), matrix factorizations consistently outperform it (see for example Table 1).
This holds in particular with depth $2$, in contrast to the conjecture’s prediction.
Together, our theory and experiments lead us to believe that it may not be possible to capture the implicit regularization in LNN with a single mathematical norm (or quasi-norm).</p>
<p>Full details behind our results on “implicit regularization as norm minimization” can be found in Section 2 of <a href="https://arxiv.org/abs/1905.13655">the paper</a>.
The trajectory analysis we discussed earlier appears in Section 3 there.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The <a href="http://www.offconvex.org/2019/06/03/trajectories/">conventional view of optimization</a> has been integral to the theory of machine learning.
Our study suggests that the associated vocabulary may not suffice for understanding generalization in deep learning, and one should instead analyze trajectories of optimization, taking into account that speed of convergence does not necessarily correlate with generalization.
We hope this work will motivate development of a new vocabulary for analyzing deep learning.</p>
Wed, 10 Jul 2019 10:00:00 -0700
http://offconvex.github.io/2019/07/10/trajectories-linear-nets/
http://offconvex.github.io/2019/07/10/trajectories-linear-nets/